"AudioSpectrogram" (Net Encoder)

NetEncoder["AudioSpectrogram"]

represents an encoder that converts an audio file or object into its spectrogram.

NetEncoder[{"AudioSpectrogram","param"->val,}]

represents an encoder with specific parameters for preprocessing and feature computation.

Details

  • The "AudioSpectrogram" encoder computes the spectrogram of a signal and discards some redundant information contained in the short-time Fourier transform. It also discards the phase information, which means that an exact reconstruction of the original signal is not possible.
  • NetEncoder[][input] applies the encoder to an input to produce a "Real32" NumericArray.
  • NetEncoder[][{input1,input2,}] applies the encoder to a list of inputs to produce a list of NumericArray objects.
  • The input to the encoder can be an Audio object or a File[] expression.
  • The output of the encoder is a rank-2 tensor of dimensions {n,Floor[(ws/2.)+1]}, where n is the number of partitions after the preprocessing is applied and ws is the length of the partitions used for the computation.
  • An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[] when constructing the net.
  • Parameters
  • The following general parameters are supported:
  • "Augmentation"Noneaugmentation to be applied
    "Normalization"Nonewhether to apply normalization
    "SampleRate"16000target sample rate
    "TargetLength"Alltarget output length
  • Additional partitioning parameters:
  • "WindowSize"Automaticlength of the partitions
    "Offset"Automaticoffset of the partitions
    "WindowFunction"Automaticwindow to be applied to the partitions
  • The following settings and suboptions can be specified for each encoder parameter.
  • "Normalization" can take the following settings:
  • Noneno normalization
    "Max"absolute maximum value normalized to 1
    {"Max",val}absolute maximum value normalized to val
    {"RMS",val}RMS of input audio signal normalized to val
  • "TargetLength" can take the following settings:
  • Allsame as input signal
    durthe duration dur specified as a time quantity
    nthe first n partitions
  • If the specified "TargetLength" does not match the length of the input signal, padding or trimming are applied accordingly.
  • "Augmentation" can be specified as a list of rules with the following keys:
  • "Convolution"Noneconvolves an impulse response to the input
    "Noise"Noneadds noise to the input
    "TimeShift"Noneshifts the input by a specified amount
    "Volume"Nonemultiplies the input with a constant
  • Any augmentation parameter that accepts a numeric value can also be specified as a list of two numbers or a univariate distribution. In the first case, the value will be randomized according to a uniform distribution between the given bounds. In the second, the user-provided distribution will be used.
  • Possible values for "Convolution" include:
  • Noneno augmentation
    signalFile or Audio object to be convolved with input
    {mix,signal}signal to be convolved with input and mix parameter
  • Possible values for "Noise" include:
  • Noneno augmentation
    ampwhite noise with amplitude amp
    noiseFile or Audio object containing the noise signal to be added
    {amp,noise}
  • noise signal and its with the specified amplitude
  • Use "TimeShift"->t to shift the input by t seconds, padding or trimming if necessary. Use Scaled[s] to shift the input by s×dur seconds, where dur is the duration of the input signal. Use {t1,t2} or Scaled[{ts1,t2}] to randomize the shift between the specified times.
  • Use "Volume"->val to specify a constant multiplier.
  • With the parameter "WindowSize"->Automatic, a partition length of 25 milliseconds is used. Use "WindowSize"->dur to select a partition length of duration dur. Use "WindowSize"->n to select a partition length of n samples.
  • With the parameter "Offset"->Automatic, a partition offset of 8.33 milliseconds is used. Use "Offset"->dur to select a partition offset of duration dur. Use "Offset"->n to select a partition offset of n samples.
  • Parameter "WindowFunction" applies a window to each partition. Possible settings are:
  • Noneno windowing applied to the input audio
    Automatic
    functhe window is computed using the function func
    listthe sampled window list is explicitly specified

Examples

open allclose all

Basic Examples  (1)

Create a spectrogram NetEncoder:

Create an Audio object:

Apply the encoder to the Audio object:

Plot the result:

Scope  (3)

NetEncoder["AudioSpectrogram"] can encode either File or Audio objects. Create a spectrogram encoder:

Apply the encoder to a File object:

Apply the encoder to an in-core Audio object:

Apply the encoder to an out-of-core Audio object:

Create a list of Audio objects:

NetEncoder["AudioSpectrogram"] maps across a batch of inputs:

Create a spectrogram NetEncoder:

Attach the encoder to the input of a net:

Apply the net to an Audio object:

Parameters  (6)

"Normalization"  (1)

Create an Audio object:

Use an encoder with "Normalization"->None to avoid any normalization:

Since the normalization is applied to the signal before the spectrogram is computed, there are no guarantees on the bounds of the result:

Use an encoder with "Normalization"->Automatic to normalize the maximum absolute value of the waveform samples to 1.:

Find the minimum and maximum values of the result:

"SampleRate"  (2)

Create an Audio object:

Using an encoder with "SampleRate"8000 resamples the signal to 8000Hz before performing the short-time Fourier transform:

The "SampleRate" parameter affects the computation of the default window size:

An encoder with a lower sample rate than the original audio will result in a shorter window length:

An encoder with a higher sample rate than the original audio will result in a longer window length:

"TargetLength"  (1)

Create an Audio object:

Using an encoder with "TargetLength"All returns the spectrogram for all the data:

Using an encoder with "TargetLength"10 zero-pads the output to be of length 10:

Using an encoder with "TargetLength"2 takes only the first two partitions:

"WindowSize"  (1)

Create an Audio object:

The partition length is automatically computed to be 25ms:

Using an encoder with "WindowSize"600 returns the spectrogram using partitions of 600 samples:

"Offset"  (1)

Create an Audio object:

The partition offset is automatically computed to be 1/3 of the partition length:

Using an encoder with "Offset"10 returns the short-time Fourier transform computed using partitions with an offset of 10 samples:

Properties & Relations  (2)

Create an Audio object:

Create a spectrogram NetEncoder:

The length of the result can be computed as Ceiling[length/offset], where length is the length of the signal after resampling and offset is the "Offset" parameter of the encoder:

Create an Audio object:

Create a spectrogram NetEncoder:

The second dimension of the result can be computed as Floor[windowSize/2+1], where windowSize is the "WindowSize" parameter of the encoder: