"AudioMelSpectrogram" (Net Encoder)

NetEncoder["AudioMelSpectrogram"]

represents an encoder that converts an audio file or object into its mel-frequency spectrogram.

NetEncoder[{"AudioMelSpectrogram","param"->val,}]

represents an encoder with specific parameters for preprocessing and feature computation.

Details

  • The "AudioMelSpectrogram" encoder computes the magnitude spectrogram and applies to it a filter bank whose filter centers are linearly spaced on the mel-frequency scale. This is done to mimic the human perception of pitch, which is nonlinear. The number of filters is always less than the number of spectrogram bins, so the dimensionality of the feature is reduced.
  • NetEncoder[][input] applies the encoder to an input to produce an output.
  • NetEncoder[][{input1,input2,}] applies the encoder to a list of inputs to produce a list of outputs.
  • The input to the encoder can be an Audio object or a File[] expression.
  • The output is computed by filtering the spectrogram with nf bandpass filters whose center frequencies are linearly spaced on the mel scale.
  • The output of the encoder is a rank-2 tensor of dimensions {n,nf}, where n is the number of partitions after the preprocessing is applied and nf is the number of filters used for the computation.
  • An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[] when constructing the net.
  • Parameters
  • The following general parameters are supported:
  • "Augmentation"Noneaugmentation to be applied
    "Normalization"Nonewhether to apply normalization
    "SampleRate"16000target sample rate
    "TargetLength"Alltarget output length
  • Additional partitioning parameters:
  • "WindowSize"Automaticlength of the partitions
    "Offset"Automaticoffset of the partitions
    "WindowFunction"Automaticwindow to be applied to the partitions
  • Mel-spectrogram parameters:
  • "MaximumFrequency"Automaticmaximum frequency of the mel filters
    "MinimumFrequency"Automaticminimum frequency of the mel filters
    "NumberOfFilters"40number of the mel filters
  • The following settings and suboptions can be specified for each encoder parameter.
  • "Normalization" can take the following settings:
  • Noneno normalization
    "Max"absolute maximum value normalized to 1
    {"Max",val}absolute maximum value normalized to val
    {"RMS",val}RMS of input audio signal normalized to val
  • "TargetLength" can take the following settings:
  • Allsame as input signal
    durthe duration dur specified as a time quantity
    nthe first n partitions
  • If the specified "TargetLength" does not match the length of the input signal, padding or trimming are applied accordingly.
  • "Augmentation" can be specified as a list of rules with the following keys:
  • "Convolution"Noneconvolves an impulse response to the input
    "Noise"Noneadds noise to the input
    "TimeShift"Noneshifts the input by a specified amount
    "Volume"Nonemultiplies the input with a constant
    "VTLP"Noneapplies vocal tract length perturbation to the input
  • Any augmentation parameter that accepts a numeric value can also be specified as a list of two numbers or a univariate distribution. In the first case, the value will be randomized according to a uniform distribution between the given bounds. In the second, the user-provided distribution will be used.
  • Possible values for "Convolution" include:
  • Noneno augmentation
    signalFile or Audio object to be convolved with input
    {mix,signal}signal to be convolved with input and mix parameter
  • Possible values for "Noise" include:
  • Noneno augmentation
    ampwhite noise with amplitude amp
    noiseFile or Audio object containing the noise signal to be added
    {amp,noise}
  • noise signal and its with the specified amplitude
  • Use "TimeShift"->t to shift the input by t seconds, padding or trimming if necessary. Use Scaled[s] to shift the input by s×dur seconds, where dur is the duration of the input signal. Use {t1,t2} or Scaled[{ts1,t2}] to randomize the shift between the specified times.
  • Use "Volume"->val to specify a constant multiplier.
  • Vocal tract length perturbation (VTLP) multiplies the center values of the filter frequencies in the mel-spectrogram by a fixed amount. Use "VTLP"val to specify the amount of the perturbation.
  • With the parameter "WindowSize"->Automatic, a partition length of 25 milliseconds is used. Use "WindowSize"->dur to select a partition length of duration dur. Use "WindowSize"->n to select a partition length of n samples.
  • With the parameter "Offset"->Automatic, a partition offset of 8.33 milliseconds is used. Use "Offset"->dur to select a partition offset of duration dur. Use "Offset"->n to select a partition offset of n samples.
  • Parameter "WindowFunction" applies a window to each partition. Possible settings are:
  • Noneno windowing applied to the input audio
    Automatic
    functhe window is computed using the function func
    listthe sampled window list is explicitly specified
  • With the parameter "MinimumFrequency"->Automatic, a frequency is computed as Ceiling[sr/ws], where sr is the sample rate "SampleRate" and ws is the partition length "WindowSize". Use "MinimumFrequency"f to set the minimum frequency for the filters to f.
  • With the parameter "MaximumFrequency"->Automatic, a frequency is computed as Round[Min[8000,sr/2]]], where sr is the sample rate "SampleRate". Use "MaximumFrequency"f to set the maximum frequency for the filters to f.
  • With the parameter "NumberOfFilters"->n, n filters will be used in the computation of the mel-spectrogram.

Examples

open all close all

Basic Examples  (1)

Create a mel-spectrogram NetEncoder:

In[1]:=
Click for copyable input
Out[1]=

Create an Audio object:

In[2]:=
Click for copyable input
Out[2]=

Apply the encoder to the Audio object:

In[3]:=
Click for copyable input
Out[3]//Short=

Plot the result:

In[4]:=
Click for copyable input
Out[4]=

Scope  (3)

Parameters  (9)

Properties & Relations  (1)