Audio Analysis with Neural Networks
Audio Encoding | Audio Embeddings |
Audio Classification | Transfer Learning for Audio |
Audio Event Detection |
The fundamental tool to transform Audio objects (or audio files) into a format appropriate for neural nets is the NetEncoder. The Wolfram Language natively provides several audio encoders that are based on different kinds of feature computations. These encoders all leverage a low-level, parallel implementation that allows for a very fast computation.
Encoder Types
All the encoders share the same preprocessing steps. The first is the extraction of the appropriate part of the signal, followed by downmixing to a single channel and resampling to a uniform frequency.
"Audio" | waveform of the signal |
"AudioMelSpectrogram" | short-time Fourier transform of the signal |
"AudioMFCC" | spectrogram of the signal |
"AudioSpectrogram" | spectrogram with frequencies equally spaced on the mel scale |
"AudioSTFT" | FourierDCT of the logarithm of the mel-spectrogram |
The "Audio" encoder simply extracts the waveform of the signal. This preserves all information, but the dimensions of the result are not ideal for neural nets.
The "AudioSTFT" encoder partitions the signal and computes a Fourier transform on each partition (the whole operation is called short-time Fourier transform, or STFT). This provides both time and frequency information, and since the Fourier transform is invertible, all the information in the original signal is preserved.
The "AudioSpectrogram" encoder computes the squared magnitude of the STFT and discards the redundant part. This reduces the dimensionality of the feature, but the phase information is lost. It is still possible to compute an approximate reconstruction of the original signal (see InverseSpectrogram for details).
The "AudioMelSpectrogram" encoder applies a filterbank to the spectrogram. The center frequencies of the filterbank are spaced linearly on the mel scale, a nonlinear frequency scale that mimics the human perception of pitch. This reduces the dimensionality even further.
The "AudioMFCC" encoder computes the FourierDCT of the logarithm of the mel-spectrogram and discards some of the higher-order coefficients. This achieves a very high dimensionality reduction while preserving a lot of the important information, especially for speech signals.
Plot the result of the "AudioMelSpectrogram" encoder:
In[7]:=7

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-n8wn6b
Out[7]=7

In[11]:=11

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-6hg6sa

In[9]:=9

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-h9wkxj
Out[9]=9

Data Augmentation
All audio encoders also share the "Augmentation" parameter. This allows them to perform data augmentation on the input, before the features (such as the spectrogram or the MFCC) are computed.
Data augmentation can be very useful when dealing with limited or reduced-size datasets, and to make a net more robust against artificial/irrelevant trends in the training data. As an example, if you were to classify recordings of cats and dogs, and in the training data all the dogs were recorded with a noisy microphone, the network might end up recognizing the noise rather than the dog.
Another convenient usage of data augmentation for audio is extracting segments from the recordings in the training data.
"TimeShift" | shifts the input by an amount that can be randomized | |
"Noise" | adds noise to the input, either from a specific piece of Audio or file or as white noise | |
"Volume" | multiplies the input with a constant that can be randomized | |
"Convolution" | convolves an impulse response to the input, either from a specific piece of Audio or a file | |
"VTLP" | applies vocal tract length perturbation to the input |
In[14]:=14

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-tok01n
Out[14]=14

In[16]:=16

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-ths89s
Out[16]=16

The dataset consists of 10K recordings of spoken digits from 0 to 9, by a collection of different speakers. The speakers of the training and testing portions of the dataset do not overlap.
In[17]:=17

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-5cr8tw
In[22]:=22

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-xbj4c9
Out[22]=22

Use the "AudioMFCC" encoder, since it provides significant dimensionality reduction while preserving a lot of the information present in speech signals.
In[21]:=21

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-creix8
Out[21]=21

In[24]:=24

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-0e3ajm
Out[24]=24

ConvolutionLayer supports variable-length inputs. This is achieved by setting the "Interleaving" option to True. Then a very simple convolutional net can be built based on the "LeNet" architecture, which is widely used in image processing. Some adjustments in the final layers need to be made to accommodate the variable-length nature of audio data.
In[25]:=25

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-kb5fnd
Out[25]=25

In[771]:=771

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-v56nc2
In[782]:=782

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-nbw76e
Out[782]=782

Another approach is to create a fully recurrent network. The net is based on a stack of GatedRecurrentLayer, followed by a simple classification section. To add some regularization, dropout at the input of the recurrent layer is used.
In[894]:=894

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-jryoo4
Out[894]=894

In[895]:=895

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-r0ijzt
Out[895]=895

In[896]:=896

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-rtr8vh
Out[896]=896

By removing the last classification layers, it is possible to obtain a feature extractor for audio signals. This feature extractor can be relevant for features that were important for the digit classification task; in other words, the learned embedding will contain information on which digit was spoken but will disregard any information regarding the speaker identity.
In[908]:=908

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-i7o5vp
Out[908]=908

In some cases, the goal is to train a network to locate sound events in a recording, but the data is "weakly labeled", where the labels only state if a certain event was present in a recording but not where. Despite the limitation of the data, it is possible to obtain good results in sound event localization through training on weakly labeled data.
Use the "Audio Cats and Dogs" dataset from the Wolfram Data Repository, a collection of recordings of cats and dogs.
In[26]:=26

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-ltybbk
Out[29]=29

In[33]:=33

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-3yoada
In[11]:=11

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-dtkgin
Out[11]=11

Use the "AudioMelSpectrogram" encoder to feed the audio signal into the network. Since the amount of data is relatively small, data augmentation can be done to make the training more effective.
In[12]:=12

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-p5t08o
Out[12]=12

In[13]:=13

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-mngq1w
Out[13]=13

The net will be based on a stack of recurrent layers (GatedRecurrentLayer), and an AggregationLayer to pool the result in the time dimension. This allows the net to output a single classification result instead of a sequence.
In[14]:=14

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-z4fbjx
Out[14]=14

In[15]:=15

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-rxx5kq
Out[15]=15

To use the net in a reproducible fashion, the encoder with the augmentations needs to be replaced with one without them.
In[17]:=17

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-lbdljb
Out[17]=17

In[18]:=18

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-ycaoq8
Out[18]=18

By removing the AggregationLayer and reattaching the SoftmaxLayer to the chopped net, a network that returns a sequence of class probabilities instead of a single classification result is obtained.
In[19]:=19

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-prdgc6
Out[19]=19

In[20]:=20

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-dnvy9w
In[45]:=45

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-j8j1il
Out[46]=46

In[47]:=47

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-jqv4mk
In[26]:=26

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-n8ceo
In[870]:=870

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-rf9ed3
Out[870]=870

An embedding can be learned as a side effect of training on a classification task. It is possible to use different training tasks to make the learning of a meaningful embedding the objective of the training.
Siamese Networks with Contrastive Loss
The first strategy to learn an embedding is training a siamese network using the contrastive loss. This involves feeding two inputs to the exact same network, computing the distance between the two outputs and feeding it to the contrastive loss. If the two inputs belong to the same class, the distance will be minimized; if they do not, it will be maximized.
This will again use the "Spoken Digit Commands" dataset. This time around, ignore which digit was spoken and only pay attention to the speaker.
In[51]:=51

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-45ohs6
Out[52]=52

In[63]:=63

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-eqw2p4

Create a set of pairs of of example speakers. Ideally, it should have as many positive examples (recordings from the same speakers) as negative examples (recordings from different speakers).
In[70]:=70

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-uhnuxz
This example will use a very simple recurrent network to produce the embedding. It will also define a net to train the embedding. This will need to feed the two inputs to the same network and compute the distance of the resulting embeddings. Use NetInsertSharedArrays to ensure that the two inputs are processed by exactly the same network.
In[71]:=71

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-omext9
Out[71]=71

In[73]:=73

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-hej7io
Out[73]=73

In[74]:=74

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-rw6rsb
Out[74]=74

In[78]:=78

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-7aedjn
Out[78]=78

In[86]:=86

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-41eu5p
In[88]:=88

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-y8tw6u
Out[88]=88

In[196]:=196

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-7jbu7a
In[210]:=210

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-76byxl
In[211]:=211

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-7agtq
Out[211]=211

Define a NearestFunction using the learned embedding and find the closest and furthest signals from an example:
In[212]:=212

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-e9u430
Out[212]=212

In[215]:=215

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-lu99wz
Out[1]=1

In[218]:=218

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-stlmwe
Out[4]=4

Out[3]=3

Pre-trained Audio Feature Extractors
An alternative to learning an embedding from scratch is to leverage pre-trained audio analysis networks. The Wolfram Neural Net Repository is an excellent source.
AudioIdentify uses a deep neural net as a back end to perform audio classification. The network was trained on the AudioSet dataset, where each audio signal is annotated with the sound classes/sources that are present in the recording. The labels are organized in an ontology of about 600 classes that span a very wide domain of sound types or sources, from musical instruments and music types to animal, mechanical and human sounds.
Import the AudioIdentify net and apply it on an Audio object:
In[114]:=114

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-e8fey8
Out[114]=114

In[115]:=115

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-p1aw9i
Out[115]=115

In the AudioIdentify net, the signal is divided into fixed-size chunks, and the main net is applied to the mel-spectrogram of each of those chunks. Use NetExtract to get it. Similarly to the architectures presented in "CNN Architectures for Large-Scale Audio Classification", the main net has a CNN architecture (based on MobileNet v2).
Extract the net at the core of the NetMapOperator:
In[116]:=116

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-kaj4wm
Out[116]=116

In[117]:=117

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-9dahen
Out[117]=117

The network used in AudioIdentify can be used not only for recognizing sounds but also to extract features from a recording. This allows any signal to be embedded in a semantically meaningful space, where similarities and distances can be computed.
The last few layers that are in charge of the classification task can be removed, and the resulting network can be reinserted into the original NetChain. This net will produced a fixed-size, semantically meaningful vector for each audio input. It can be used as a feature extractor for all the the high-level machine learning functions in the system or as a starting point to train a new neural net.
Create a feature extractor net and use it in FeatureSpacePlot:
In[118]:=118

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-mqtwbz
Out[118]=118

In[119]:=119

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-38csfs
Out[119]=119

Another alternative is the "VGGish Feature Extractor Trained on YouTube Data" model, a structurally similar net that was trained by Google on data from YouTube specifically for audio feature extraction.
Use the VGGish network as a feature extractor in FeatureSpacePlot:
In[120]:=120

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-f740ld
Out[120]=120

In[12]:=12

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-bvaktz
Out[12]=12

Sometimes the amount of data available to train a network is insufficient for the task at hand. Transfer learning is a possible solution to this problem. Instead of training a network from scratch, it is possible to use as a starting point a net that has already been trained on a different but related task.
Begin by downloading the ESC-50 dataset.
In[124]:=124

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-y24nzd
In[126]:=126

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-ofdj3c
In[128]:=128

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-v7lwxz
Out[128]=128

In[129]:=129

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-t4vds1

Use the network from AudioIdentify as a starting point.
In[5]:=5

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-xj020g
Out[6]=6

In[7]:=7

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-e5s5ab
Out[7]=7

Instead of retraining the full net and specifying a LearningRateMultipliers option in NetTrain to train only the classification layers, precompute the results of the feature extractor net and train the classifier. This avoids redundant evaluation of the full net during training.
In[134]:=134

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-s0snhw
In[136]:=136

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-6o2da0
Construct a simple recurrent classifier network that will be attached to the feature extractor and train it on each of the folds.
In[137]:=137

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-23vy1n
Out[137]=137

In[138]:=138

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-17luqk
After training, it is easy to measure the accuracy of the classifiers on each of the folds and average them to obtain a cross-validated result.
In[139]:=139

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-g31exd
Out[139]=139

In[140]:=140

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-k8sw9l
Out[140]=140

As a final test, construct a net that joins the original feature extractor and an ensemble average of the trained classifiers and run it on a signal that was not in the original dataset.
In[141]:=141

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-jq25gn
Out[141]=141

In[142]:=142

✖
https://wolfram.com/xid/0ds1zfxk1sosbbhsmtywf1uc-zolubx
Out[142]=142
