Audio Analysis with Neural Networks
Audio Encoding | Audio Embeddings |
Audio Classification | Transfer Learning for Audio |
Audio Event Detection |
The fundamental tool to transform Audio objects (or audio files) into a format appropriate for neural nets is the NetEncoder. The Wolfram Language natively provides several audio encoders that are based on different kinds of feature computations. These encoders all leverage a low-level, parallel implementation that allows for a very fast computation.
Encoder Types
All the encoders share the same preprocessing steps. The first is the extraction of the appropriate part of the signal, followed by downmixing to a single channel and resampling to a uniform frequency.
"Audio" | waveform of the signal |
"AudioMelSpectrogram" | short-time Fourier transform of the signal |
"AudioMFCC" | spectrogram of the signal |
"AudioSpectrogram" | spectrogram with frequencies equally spaced on the mel scale |
"AudioSTFT" | FourierDCT of the logarithm of the mel-spectrogram |
The "Audio" encoder simply extracts the waveform of the signal. This preserves all information, but the dimensions of the result are not ideal for neural nets.
The "AudioSTFT" encoder partitions the signal and computes a Fourier transform on each partition (the whole operation is called short-time Fourier transform, or STFT). This provides both time and frequency information, and since the Fourier transform is invertible, all the information in the original signal is preserved.
The "AudioSpectrogram" encoder computes the squared magnitude of the STFT and discards the redundant part. This reduces the dimensionality of the feature, but the phase information is lost. It is still possible to compute an approximate reconstruction of the original signal (see InverseSpectrogram for details).
The "AudioMelSpectrogram" encoder applies a filterbank to the spectrogram. The center frequencies of the filterbank are spaced linearly on the mel scale, a nonlinear frequency scale that mimics the human perception of pitch. This reduces the dimensionality even further.
The "AudioMFCC" encoder computes the FourierDCT of the logarithm of the mel-spectrogram and discards some of the higher-order coefficients. This achieves a very high dimensionality reduction while preserving a lot of the important information, especially for speech signals.
Plot the result of the "AudioMelSpectrogram" encoder:
Data Augmentation
All audio encoders also share the "Augmentation" parameter. This allows them to perform data augmentation on the input, before the features (such as the spectrogram or the MFCC) are computed.
Data augmentation can be very useful when dealing with limited or reduced-size datasets, and to make a net more robust against artificial/irrelevant trends in the training data. As an example, if you were to classify recordings of cats and dogs, and in the training data all the dogs were recorded with a noisy microphone, the network might end up recognizing the noise rather than the dog.
Another convenient usage of data augmentation for audio is extracting segments from the recordings in the training data.
"TimeShift" | shifts the input by an amount that can be randomized | |
"Noise" | adds noise to the input, either from a specific piece of Audio or file or as white noise | |
"Volume" | multiplies the input with a constant that can be randomized | |
"Convolution" | convolves an impulse response to the input, either from a specific piece of Audio or a file | |
"VTLP" | applies vocal tract length perturbation to the input |
The dataset consists of 10K recordings of spoken digits from 0 to 9, by a collection of different speakers. The speakers of the training and testing portions of the dataset do not overlap.
Use the "AudioMFCC" encoder, since it provides significant dimensionality reduction while preserving a lot of the information present in speech signals.
ConvolutionLayer supports variable-length inputs. This is achieved by setting the "Interleaving" option to True. Then a very simple convolutional net can be built based on the "LeNet" architecture, which is widely used in image processing. Some adjustments in the final layers need to be made to accommodate the variable-length nature of audio data.
Another approach is to create a fully recurrent network. The net is based on a stack of GatedRecurrentLayer, followed by a simple classification section. To add some regularization, dropout at the input of the recurrent layer is used.
By removing the last classification layers, it is possible to obtain a feature extractor for audio signals. This feature extractor can be relevant for features that were important for the digit classification task; in other words, the learned embedding will contain information on which digit was spoken but will disregard any information regarding the speaker identity.
In some cases, the goal is to train a network to locate sound events in a recording, but the data is "weakly labeled", where the labels only state if a certain event was present in a recording but not where. Despite the limitation of the data, it is possible to obtain good results in sound event localization through training on weakly labeled data.
Use the "Audio Cats and Dogs" dataset from the Wolfram Data Repository, a collection of recordings of cats and dogs.
Use the "AudioMelSpectrogram" encoder to feed the audio signal into the network. Since the amount of data is relatively small, data augmentation can be done to make the training more effective.
The net will be based on a stack of recurrent layers (GatedRecurrentLayer), and an AggregationLayer to pool the result in the time dimension. This allows the net to output a single classification result instead of a sequence.
To use the net in a reproducible fashion, the encoder with the augmentations needs to be replaced with one without them.
By removing the AggregationLayer and reattaching the SoftmaxLayer to the chopped net, a network that returns a sequence of class probabilities instead of a single classification result is obtained.
An embedding can be learned as a side effect of training on a classification task. It is possible to use different training tasks to make the learning of a meaningful embedding the objective of the training.
Siamese Networks with Contrastive Loss
The first strategy to learn an embedding is training a siamese network using the contrastive loss. This involves feeding two inputs to the exact same network, computing the distance between the two outputs and feeding it to the contrastive loss. If the two inputs belong to the same class, the distance will be minimized; if they do not, it will be maximized.
This will again use the "Spoken Digit Commands" dataset. This time around, ignore which digit was spoken and only pay attention to the speaker.
Create a set of pairs of of example speakers. Ideally, it should have as many positive examples (recordings from the same speakers) as negative examples (recordings from different speakers).
This example will use a very simple recurrent network to produce the embedding. It will also define a net to train the embedding. This will need to feed the two inputs to the same network and compute the distance of the resulting embeddings. Use NetInsertSharedArrays to ensure that the two inputs are processed by exactly the same network.
Define a NearestFunction using the learned embedding and find the closest and furthest signals from an example:
Pre-trained Audio Feature Extractors
An alternative to learning an embedding from scratch is to leverage pre-trained audio analysis networks. The Wolfram Neural Net Repository is an excellent source.
AudioIdentify uses a deep neural net as a back end to perform audio classification. The network was trained on the AudioSet dataset, where each audio signal is annotated with the sound classes/sources that are present in the recording. The labels are organized in an ontology of about 600 classes that span a very wide domain of sound types or sources, from musical instruments and music types to animal, mechanical and human sounds.
Import the AudioIdentify net and apply it on an Audio object:
In the AudioIdentify net, the signal is divided into fixed-size chunks, and the main net is applied to the mel-spectrogram of each of those chunks. Use NetExtract to get it. Similarly to the architectures presented in "CNN Architectures for Large-Scale Audio Classification", the main net has a CNN architecture (based on MobileNet v2).
Extract the net at the core of the NetMapOperator:
The network used in AudioIdentify can be used not only for recognizing sounds but also to extract features from a recording. This allows any signal to be embedded in a semantically meaningful space, where similarities and distances can be computed.
The last few layers that are in charge of the classification task can be removed, and the resulting network can be reinserted into the original NetChain. This net will produced a fixed-size, semantically meaningful vector for each audio input. It can be used as a feature extractor for all the the high-level machine learning functions in the system or as a starting point to train a new neural net.
Create a feature extractor net and use it in FeatureSpacePlot:
Another alternative is the "VGGish Feature Extractor Trained on YouTube Data" model, a structurally similar net that was trained by Google on data from YouTube specifically for audio feature extraction.
Use the VGGish network as a feature extractor in FeatureSpacePlot:
Sometimes the amount of data available to train a network is insufficient for the task at hand. Transfer learning is a possible solution to this problem. Instead of training a network from scratch, it is possible to use as a starting point a net that has already been trained on a different but related task.
Begin by downloading the ESC-50 dataset.
Use the network from AudioIdentify as a starting point.
Instead of retraining the full net and specifying a LearningRateMultipliers option in NetTrain to train only the classification layers, precompute the results of the feature extractor net and train the classifier. This avoids redundant evaluation of the full net during training.
Construct a simple recurrent classifier network that will be attached to the feature extractor and train it on each of the folds.
After training, it is easy to measure the accuracy of the classifiers on each of the folds and average them to obtain a cross-validated result.