"AudioMFCC" (Net Encoder)
NetEncoder["AudioMFCC"]
represents an encoder that converts an audio file or object into its mel-frequency cepstral coefficients.
NetEncoder[{"AudioMFCC","param"->val,…}]
represents an encoder with specific parameters for preprocessing and feature computation.
Details
- The "AudioMFCC" encoder computes the FourierDCT of the logarithm of each frame of the mel-spectrogram. Only the first few coefficients are kept. The Mel-Frequency Cepstral Coefficients (MFCC) manage to reduce the dimensionality of the feature very dramatically, while preserving a large amount of the information contained in the original signal, especially in the case of speech.
- NetEncoder[…][input] applies the encoder to an input to produce a "Real32" output.
- NetEncoder[…][{input1,input2,…}] applies the encoder to a list of inputs to produce a list of outputs.
- When given a NumericArray as input, the output will be a NumericArray.
- The input to the encoder can be an Audio object or a File[…] expression.
- The output is computed by applying a discrete cosine transform to the mel-spectrogram, keeping only the first nc coefficients.
- The output of the encoder is a rank-2 tensor of dimensions {n,nc}, where n is the number of partitions after the preprocessing is applied and nc is the number of coefficients used for the computation.
- An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.
- The following general parameters are supported:
-
"Augmentation" None augmentation to be applied "Normalization" None whether to apply normalization "SampleRate" 16000 target sample rate "TargetLength" All target output length - Additional partitioning parameters:
-
"WindowSize" Automatic length of the partitions "Offset" Automatic offset of the partitions "WindowFunction" Automatic window to be applied to the partitions - Mel-spectrogram parameters:
-
"MaximumFrequency" Automatic maximum frequency of the mel filters "MinimumFrequency" Automatic minimum frequency of the mel filters "NumberOfFilters" 40 number of the mel filters - MFCC parameter:
-
"NumberOfCoefficients" 13 number of coefficients - The following settings and suboptions can be specified for each encoder parameter.
- "Normalization" can take the following settings:
-
None no normalization "Max" absolute maximum value normalized to 1 {"Max",val} absolute maximum value normalized to val {"RMS",val} RMS of input audio signal normalized to val - "TargetLength" can take the following settings:
-
All same as input signal dur the duration dur specified as a time quantity n the first n partitions - If the specified "TargetLength" does not match the length of the input signal, padding or trimming are applied accordingly.
- "Augmentation" can be specified as a list of rules with the following keys:
-
"Convolution" None convolves an impulse response to the input "Noise" None adds noise to the input "TimeShift" None shifts the input by a specified amount "Volume" None multiplies the input with a constant "VTLP" None applies vocal tract length perturbation to the input - Any augmentation parameter that accepts a numeric value can also be specified as a list of two numbers or a univariate distribution. In the first case, the value will be randomized according to a uniform distribution between the given bounds. In the second, the user-provided distribution will be used.
- Possible values for "Convolution" include:
-
None no augmentation signal File or Audio object to be convolved with input {mix,signal} signal to be convolved with input and mix parameter - Possible values for "Noise" include:
-
None no augmentation amp white noise with amplitude amp noise File or Audio object containing the noise signal to be added {amp,noise} - noise signal and its with the specified amplitude
- Use "TimeShift"->t to shift the input by t seconds, padding or trimming if necessary. Use Scaled[s] to shift the input by s×dur seconds, where dur is the duration of the input signal. Use {t1,t2} or Scaled[{ts1,t2}] to randomize the shift between the specified times.
- Use "Volume"->val to specify a constant multiplier.
- Vocal tract length perturbation (VTLP) multiplies the center values of the filter frequencies in the mel-spectrogram by a fixed amount. Use "VTLP"val to specify the amount of the perturbation.
- With the parameter "WindowSize"->Automatic, a partition length of 25 milliseconds is used. Use "WindowSize"->dur to select a partition length of duration dur. Use "WindowSize"->n to select a partition length of n samples.
- With the parameter "Offset"->Automatic, a partition offset of 8.33 milliseconds is used. Use "Offset"->dur to select a partition offset of duration dur. Use "Offset"->n to select a partition offset of n samples.
- Parameter "WindowFunction" applies a window to each partition. Possible settings are:
-
None no windowing applied to the input audio Automatic func the window is computed using the function func list the sampled window list is explicitly specified - With the parameter "MinimumFrequency"->Automatic, a frequency is computed as Ceiling[sr/ws], where sr is the sample rate "SampleRate" and ws is the partition length "WindowSize". Use "MinimumFrequency"f to set the minimum frequency for the filters to f.
- With the parameter "MaximumFrequency"->Automatic, a frequency is computed as Round[Min[8000,sr/2]]], where sr is the sample rate "SampleRate". Use "MaximumFrequency"f to set the maximum frequency for the filters to f.
- With the parameter "NumberOfFilters"->n, n filters will be used in the computation of the MFCC.
- With the parameter "NumberOfCoefficients"->n, n coefficients will be used in the computation of the MFCC.
Parameters
Examples
open allclose allBasic Examples (1)
Scope (3)
NetEncoder["AudioMFCC"] can encode either File or Audio objects. Create a mel-spectrogram encoder:
Apply the encoder to a File object:
Apply the encoder to an in-core Audio object:
Apply the encoder to an out-of-core Audio object:
Create a list of Audio objects:
NetEncoder["AudioMFCC"] maps across a batch of inputs:
Create an MFCC NetEncoder:
Attach the encoder to the input of a net:
Apply the net to an Audio object:
Parameters (9)
"Normalization" (1)
Create an Audio object:
Use an encoder with "Normalization"->None to avoid any normalization:
Since the normalization is applied to the signal before the spectrogram is computed, there are no guarantees on the bounds of the result:
Use an encoder with "Normalization""Max" to normalize the maximum absolute value of the waveform samples to 1.:
"SampleRate" (1)
Create an Audio object:
Using an encoder with "SampleRate"8000 resamples the signal to 8000Hz before performing the short-time Fourier transform:
"TargetLength" (1)
"Offset" (1)
Create an Audio object:
The partition offset is automatically computed to be 1/3 of the partition length:
Using an encoder with "Offset"10 returns the MFCC computed using partitions with an offset of 10 samples:
"MinimumFrequency" (1)
Create an Audio object:
The minimum frequency is automatically computed to be Ceiling[sr/ws], where sr is the sample rate "SampleRate" and ws is the partition length "WindowSize":
Using an encoder with "MinimumFrequency"2000 returns the MFCC computed using filters whose minimum frequency is 2000Hz:
"MaximumFrequency" (1)
"NumberOfFilters" (1)
Create an Audio object:
By default, 40 filters are used for the computation of the MFCC:
Using an encoder with "NumberOfFilters"14 returns the MFCC computed using 14 filters:
"NumberOfCoefficients" (1)
Create an Audio object:
By default, 13 coefficients are used for the computation of the MFCC:
Using an encoder with "NumberOfCoefficients"40 returns the MFCC computed using 40 filters:
Properties & Relations (1)
Create an Audio object:
Create an MFCC NetEncoder:
The length of the result can be computed as Ceiling[length/offset], where length is the length of the signal after resampling and offset is the "Offset" parameter of the encoder: