Audio Analysis with Neural Networks

Audio Encoding
The fundamental tool to transform Audio objects (or audio files) into a format appropriate for neural nets is the NetEncoder. The Wolfram Language natively provides several audio encoders that are based on different kinds of feature computations. These encoders all leverage a low-level, parallel implementation that allows for a very fast computation.

Encoder Types

All the encoders share the same preprocessing steps. The first is the extraction of the appropriate part of the signal, followed by downmixing to a single channel and resampling to a uniform frequency.
The available audio encoders are:
"Audio"
waveform of the signal
"AudioMelSpectrogram"
short-time Fourier transform of the signal
"AudioMFCC"
spectrogram of the signal
"AudioSpectrogram"
spectrogram with frequencies equally spaced on the mel scale
"AudioSTFT"
FourierDCT of the logarithm of the mel-spectrogram
Audio encoders.
The "Audio" encoder simply extracts the waveform of the signal. This preserves all information, but the dimensions of the result are not ideal for neural nets.
The "AudioSTFT" encoder partitions the signal and computes a Fourier transform on each partition (the whole operation is called short-time Fourier transform, or STFT). This provides both time and frequency information, and since the Fourier transform is invertible, all the information in the original signal is preserved.
The "AudioSpectrogram" encoder computes the squared magnitude of the STFT and discards the redundant part. This reduces the dimensionality of the feature, but the phase information is lost. It is still possible to compute an approximate reconstruction of the original signal (see InverseSpectrogram for details).
The "AudioMelSpectrogram" encoder applies a filterbank to the spectrogram. The center frequencies of the filterbank are spaced linearly on the mel scale, a nonlinear frequency scale that mimics the human perception of pitch. This reduces the dimensionality even further.
The "AudioMFCC" encoder computes the FourierDCT of the logarithm of the mel-spectrogram and discards some of the higher-order coefficients. This achieves a very high dimensionality reduction while preserving a lot of the important information, especially for speech signals.
Plot the result of the "AudioMelSpectrogram" encoder:

Data Augmentation

All audio encoders also share the "Augmentation" parameter. This allows them to perform data augmentation on the input, before the features (such as the spectrogram or the MFCC) are computed.
Data augmentation can be very useful when dealing with limited or reduced-size datasets, and to make a net more robust against artificial/irrelevant trends in the training data. As an example, if you were to classify recordings of cats and dogs, and in the training data all the dogs were recorded with a noisy microphone, the network might end up recognizing the noise rather than the dog.
Another convenient usage of data augmentation for audio is extracting segments from the recordings in the training data.
The available augmentations are:
"TimeShift"
shifts the input by an amount that can be randomized
"Noise"
adds noise to the input, either from a specific piece of Audio or file or as white noise
"Volume"
multiplies the input with a constant that can be randomized
"Convolution"
convolves an impulse response to the input, either from a specific piece of Audio or a file
"VTLP"
applies vocal tract length perturbation to the input
All of the augmentations can be randomized.
Apply two augmentations at the same time:
Audio Classification
Train a classification network on the "Spoken Digit Commands" dataset.
The dataset consists of 10K recordings of spoken digits from 0 to 9, by a collection of different speakers. The speakers of the training and testing portions of the dataset do not overlap.
Gather the data:
Use the "AudioMFCC" encoder, since it provides significant dimensionality reduction while preserving a lot of the information present in speech signals.
Create the encoder and decoder:
ConvolutionLayer supports variable-length inputs. This is achieved by setting the "Interleaving" option to True. Then a very simple convolutional net can be built based on the "LeNet" architecture, which is widely used in image processing. Some adjustments in the final layers need to be made to accommodate the variable-length nature of audio data.
Create the convolutional net, train it on the dataset and measure its performance:
Another approach is to create a fully recurrent network. The net is based on a stack of GatedRecurrentLayer, followed by a simple classification section. To add some regularization, dropout at the input of the recurrent layer is used.
Create the recurrent net, train it on the dataset and measure its performance:
By removing the last classification layers, it is possible to obtain a feature extractor for audio signals. This feature extractor can be relevant for features that were important for the digit classification task; in other words, the learned embedding will contain information on which digit was spoken but will disregard any information regarding the speaker identity.
Plot the testing dataset using the embedding learned by the network:
Audio Event Detection
In some cases, the goal is to train a network to locate sound events in a recording, but the data is "weakly labeled", where the labels only state if a certain event was present in a recording but not where. Despite the limitation of the data, it is possible to obtain good results in sound event localization through training on weakly labeled data.
Use the "Audio Cats and Dogs" dataset from the Wolfram Data Repository, a collection of recordings of cats and dogs.
Retrieve the dataset:
The duration of the signals varies between 1 and 18 seconds.
Divide the data into training and testing sets and plot a histogram of the duration of the signals:
Use the "AudioMelSpectrogram" encoder to feed the audio signal into the network. Since the amount of data is relatively small, data augmentation can be done to make the training more effective.
Create the encoder and decoder:
The net will be based on a stack of recurrent layers (GatedRecurrentLayer), and an AggregationLayer to pool the result in the time dimension. This allows the net to output a single classification result instead of a sequence.
Create the net and train it on the dataset:
To use the net in a reproducible fashion, the encoder with the augmentations needs to be replaced with one without them.
Measure the performance of the net:
By removing the AggregationLayer and reattaching the SoftmaxLayer to the chopped net, a network that returns a sequence of class probabilities instead of a single classification result is obtained.
Create the time-resolved net and define a function to easily get its results:
Test this time-resolved net on a signal that contains both cat and dog noises.
Test the net on a signal that was not in the training set:
Audio Embeddings
An embedding can be learned as a side effect of training on a classification task. It is possible to use different training tasks to make the learning of a meaningful embedding the objective of the training.

Siamese Networks with Contrastive Loss

The first strategy to learn an embedding is training a siamese network using the contrastive loss. This involves feeding two inputs to the exact same network, computing the distance between the two outputs and feeding it to the contrastive loss. If the two inputs belong to the same class, the distance will be minimized; if they do not, it will be maximized.
This will again use the "Spoken Digit Commands" dataset. This time around, ignore which digit was spoken and only pay attention to the speaker.
Gather the data and group it by the speaker ID:
Create a set of pairs of of example speakers. Ideally, it should have as many positive examples (recordings from the same speakers) as negative examples (recordings from different speakers).
Create the training pairs:
This example will use a very simple recurrent network to produce the embedding. It will also define a net to train the embedding. This will need to feed the two inputs to the same network and compute the distance of the resulting embeddings. Use NetInsertSharedArrays to ensure that the two inputs are processed by exactly the same network.
Create the net that will produce the embedding and the net to compute the contrastive loss:
Once the network is trained, extract one of the equivalent subnets that computes the embedding.
Train the pair-embedding net and extract the embedding net from the trained result:
Now use the embedding to visualize collections of signals.
Visualize the test dataset in the computed embedding space:
Or compare different recordings by measuring the distance between their embeddings.
Plot the distance matrix between different examples in the test dataset:
Define a NearestFunction using the learned embedding and find the closest and furthest signals from an example:

Pre-trained Audio Feature Extractors

An alternative to learning an embedding from scratch is to leverage pre-trained audio analysis networks. The Wolfram Neural Net Repository is an excellent source.
AudioIdentify uses a deep neural net as a back end to perform audio classification. The network was trained on the AudioSet dataset, where each audio signal is annotated with the sound classes/sources that are present in the recording. The labels are organized in an ontology of about 600 classes that span a very wide domain of sound types or sources, from musical instruments and music types to animal, mechanical and human sounds.
Import the AudioIdentify net and apply it on an Audio object:
In the AudioIdentify net, the signal is divided into fixed-size chunks, and the main net is applied to the mel-spectrogram of each of those chunks. Use NetExtract to get it. Similarly to the architectures presented in "CNN Architectures for Large-Scale Audio Classification", the main net has a CNN architecture (based on MobileNet v2).
Extract the net at the core of the NetMapOperator:
The network used in AudioIdentify can be used not only for recognizing sounds but also to extract features from a recording. This allows any signal to be embedded in a semantically meaningful space, where similarities and distances can be computed.
The last few layers that are in charge of the classification task can be removed, and the resulting network can be reinserted into the original NetChain. This net will produced a fixed-size, semantically meaningful vector for each audio input. It can be used as a feature extractor for all the the high-level machine learning functions in the system or as a starting point to train a new neural net.
Create a feature extractor net and use it in FeatureSpacePlot:
Another alternative is the "VGGish Feature Extractor Trained on YouTube Data" model, a structurally similar net that was trained by Google on data from YouTube specifically for audio feature extraction.
Use the VGGish network as a feature extractor in FeatureSpacePlot:
Transfer Learning for Audio
Sometimes the amount of data available to train a network is insufficient for the task at hand. Transfer learning is a possible solution to this problem. Instead of training a network from scratch, it is possible to use as a starting point a net that has already been trained on a different but related task.
Begin by downloading the ESC-50 dataset.
Download and parse the ESC-50 dataset:
Use the network from AudioIdentify as a starting point.
Construct the feature extraction network:
Instead of retraining the full net and specifying a LearningRateMultipliers option in NetTrain to train only the classification layers, precompute the results of the feature extractor net and train the classifier. This avoids redundant evaluation of the full net during training.
It is also possible to divide the dataset into the folds defined by its creator.
Precompute the features and divide the dataset according to the original folds:
Construct a simple recurrent classifier network that will be attached to the feature extractor and train it on each of the folds.
Define a simple recurrent classifier and train it on each of the folds:
After training, it is easy to measure the accuracy of the classifiers on each of the folds and average them to obtain a cross-validated result.
Measure the performance on each fold and compute their average:
As a final test, construct a net that joins the original feature extractor and an ensemble average of the trained classifiers and run it on a signal that was not in the original dataset.
Test the final net on an unrelated example: