"AudioMFCC" (Net Encoder)

NetEncoder["AudioMFCC"]

represents an encoder that converts an audio file or object into its mel-frequency cepstral coefficients.

NetEncoder[{"AudioMFCC","param"->val,}]

represents an encoder with specific parameters for preprocessing and feature computation.

Details

  • The "AudioMFCC" encoder computes the FourierDCT of the logarithm of each frame of the mel-spectrogram. Only the first few coefficients are kept. The Mel-Frequency Cepstral Coefficients (MFCC) manage to reduce the dimensionality of the feature very dramatically, while preserving a large amount of the information contained in the original signal, especially in the case of speech.
  • NetEncoder[][input] applies the encoder to an input to produce a "Real32" output.
  • NetEncoder[][{input1,input2,}] applies the encoder to a list of inputs to produce a list of outputs.
  • When given a NumericArray as input, the output will be a NumericArray.
  • The input to the encoder can be an Audio object or a File[] expression.
  • The output is computed by applying a discrete cosine transform to the mel-spectrogram, keeping only the first nc coefficients.
  • The output of the encoder is a rank-2 tensor of dimensions {n,nc}, where n is the number of partitions after the preprocessing is applied and nc is the number of coefficients used for the computation.
  • An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[] when constructing the net.
  • Parameters
  • The following general parameters are supported:
  • "Augmentation"Noneaugmentation to be applied
    "Normalization"Nonewhether to apply normalization
    "SampleRate"16000target sample rate
    "TargetLength"Alltarget output length
  • Additional partitioning parameters:
  • "WindowSize"Automaticlength of the partitions
    "Offset"Automaticoffset of the partitions
    "WindowFunction"Automaticwindow to be applied to the partitions
  • Mel-spectrogram parameters:
  • "MaximumFrequency"Automaticmaximum frequency of the mel filters
    "MinimumFrequency"Automaticminimum frequency of the mel filters
    "NumberOfFilters"40number of the mel filters
  • MFCC parameter:
  • "NumberOfCoefficients"13number of coefficients
  • The following settings and suboptions can be specified for each encoder parameter.
  • "Normalization" can take the following settings:
  • Noneno normalization
    "Max"absolute maximum value normalized to 1
    {"Max",val}absolute maximum value normalized to val
    {"RMS",val}RMS of input audio signal normalized to val
  • "TargetLength" can take the following settings:
  • Allsame as input signal
    durthe duration dur specified as a time quantity
    nthe first n partitions
  • If the specified "TargetLength" does not match the length of the input signal, padding or trimming are applied accordingly.
  • "Augmentation" can be specified as a list of rules with the following keys:
  • "Convolution"Noneconvolves an impulse response to the input
    "Noise"Noneadds noise to the input
    "TimeShift"Noneshifts the input by a specified amount
    "Volume"Nonemultiplies the input with a constant
    "VTLP"Noneapplies vocal tract length perturbation to the input
  • Any augmentation parameter that accepts a numeric value can also be specified as a list of two numbers or a univariate distribution. In the first case, the value will be randomized according to a uniform distribution between the given bounds. In the second, the user-provided distribution will be used.
  • Possible values for "Convolution" include:
  • Noneno augmentation
    signalFile or Audio object to be convolved with input
    {mix,signal}signal to be convolved with input and mix parameter
  • Possible values for "Noise" include:
  • Noneno augmentation
    ampwhite noise with amplitude amp
    noiseFile or Audio object containing the noise signal to be added
    {amp,noise}
  • noise signal and its with the specified amplitude
  • Use "TimeShift"->t to shift the input by t seconds, padding or trimming if necessary. Use Scaled[s] to shift the input by s×dur seconds, where dur is the duration of the input signal. Use {t1,t2} or Scaled[{ts1,t2}] to randomize the shift between the specified times.
  • Use "Volume"->val to specify a constant multiplier.
  • Vocal tract length perturbation (VTLP) multiplies the center values of the filter frequencies in the mel-spectrogram by a fixed amount. Use "VTLP"val to specify the amount of the perturbation.
  • With the parameter "WindowSize"->Automatic, a partition length of 25 milliseconds is used. Use "WindowSize"->dur to select a partition length of duration dur. Use "WindowSize"->n to select a partition length of n samples.
  • With the parameter "Offset"->Automatic, a partition offset of 8.33 milliseconds is used. Use "Offset"->dur to select a partition offset of duration dur. Use "Offset"->n to select a partition offset of n samples.
  • Parameter "WindowFunction" applies a window to each partition. Possible settings are:
  • Noneno windowing applied to the input audio
    Automatic
    functhe window is computed using the function func
    listthe sampled window list is explicitly specified
  • With the parameter "MinimumFrequency"->Automatic, a frequency is computed as Ceiling[sr/ws], where sr is the sample rate "SampleRate" and ws is the partition length "WindowSize". Use "MinimumFrequency"f to set the minimum frequency for the filters to f.
  • With the parameter "MaximumFrequency"->Automatic, a frequency is computed as Round[Min[8000,sr/2]]], where sr is the sample rate "SampleRate". Use "MaximumFrequency"f to set the maximum frequency for the filters to f.
  • With the parameter "NumberOfFilters"->n, n filters will be used in the computation of the MFCC.
  • With the parameter "NumberOfCoefficients"->n, n coefficients will be used in the computation of the MFCC.

Examples

open all close all

Basic Examples  (1)

Create an MFCC NetEncoder:

In[1]:=
Click for copyable input
Out[1]=

Create an Audio object:

In[2]:=
Click for copyable input
Out[2]=

Apply the encoder to the Audio object:

In[3]:=
Click for copyable input
Out[3]//Short=

Plot the result:

In[4]:=
Click for copyable input
Out[4]=

Scope  (3)

Parameters  (9)

Properties & Relations  (1)