Sequence Learning and NLP with Neural Networks
Sequence Regression | Question Answering |
Sequence Classification | Language Modeling |
Sequence-to-Sequence Learning |
Sequence learning refers to a variety of related tasks that neural nets can be trained to perform. What all these tasks have in common is that the input to the net is a sequence of some kind. This input is usually variable length, meaning that the net can operate equally well on short or long sequences.
- A variable-length array, encoded from a string using a "Characters" or "Tokens" NetEncoder.
- A variable-length array, encoded from an Audio object using an "Audio", "AudioSpectrogram", "AudioSTFT", etc. NetEncoder.
- Fixed-length forms of the above, e.g. by using the "TargetLength" option to the NetEncoder.
What distinguishes the various sequence learning tasks is the form of the output of the net. Here, there is wide diversity of techniques, with corresponding forms of output:
- For autoregressive language models, used to model the probability of of a particular sequence x, the output is the next element of the sequence. In the case of a textual model, this is a character or token, as decoded via a "Class", "Characters" or "Token" NetDecoder.
- For sequence tagging models, the output is a sequence of classes of the same length as the input. For example, in the case of a part-of-speech tagger, these classes are "Noun", "Verb", etc. For this, a "Class" NetDecoder is appropriate.
- For translation models, e.g. an English to French translator, the output is itself a language model, albeit one that is conditional on the source sequence. In other words, there are two inputs to the net: the complete source sequence, and the target sequence so far, and the output is a prediction of the next element of the target sequence.
- For CTC models, the input sequence is used to form a sequence of intermittent predictions for the target sequence, which is always shorter than the input sequence. Examples of this include handwriting recognition from pixel or stroke data, in which the input is segmented into individual characters, or audio transcription, in which features of the audio are segmented into characters or phonemes. For these, a "CTCBeamSearch" NetDecoder must be used.
Integer Addition
In this example, a net is trained to add two two-digit integers together. What makes the problem hard is that the inputs are strings rather than numeric values, whereas the output is a numeric value. For example, the net takes as input "25+33" and needs to return a real number close to 58. Note that the input is variable length, as the training data contains examples of length 3 ("5+5"), length 4 ("5+10") and length 5 ("10+10").
Create training data based on strings that describe two-digit additions and the corresponding numeric result:
Create a net composed of a chain of recurrent layers to read an input string and predict the numeric result:
Sentiment Analysis
Sequence-to-sequence learning is a learning task where both the input and the predicted output are sequences. Tasks such as translating German to English, transcribing speech from an audio file, sorting lists of numbers, etc. are examples of this task.
Integer Addition with Fixed-Length Output
This example demonstrates how to train nets that take a variable-length sequence as input and predict a fixed-length sequence as output. We take a string that describes a sum, e.g. "250+123", and produce an output string that describes the answer, e.g. "373".
Create training data based on strings that describe three-digit additions and the corresponding result as a string. In order for the output to be fixed length, all outputs are padded to the maximum length of 4 (as the maximum value is 999+999=1998):
Integer Addition with Variable-Length Output
This example demonstrates how to train nets on sequences where both the input and output are variable-length sequences, and those sequences are not the same. One sophisticated example of this task is translating English to German, but the example we cover is a simpler problem: taking a string that describes a sum, e.g. "250+123", and producing an output string that describes the answer, e.g. "373". The method used is based on I. Sutskever et al., "Sequence to Sequence Learning with Neural Networks", 2014.
Create training data based on strings that describe three-digit additions and the corresponding result as a string:
Create a NetEncoder that uses a special code for the start and end of a string, which will be used to indicate the beginning and end of the output sequence:
Define a net that takes an input vector of size 150 and a sequence of vectors as input and returns a sequence of vectors as output:
Define a net with a CrossEntropyLossLayer and containing the encoder and decoder nets:
In this case of three-digit integer addition, there are only 1999 possible outputs. It is feasible to calculate the loss for each possible output and find the one that minimizes the loss:
A more efficient way of obtaining predictions is to generate the output until the EndOfString virtual character is reached.
First, extract the trained "encoder" and "decoder" subnets from the trained NetGraph, and attach appropriate encoders and decoders:
Use a SequenceLastLayer to make a version of the decoder that only produces predictions for the last character of the answer, given the previous characters. Here a character decoder is attached for the probability vector using the same alphabet as the "Target" input:
Now define a prediction function that takes the "encoder" and "decoder" nets and an input string. The function will compute successively longer results until the decoder claims to be finished:
The naive technique of generating by passing in each partial answer to the decoder net to derive the next character has time complexity of n2, where n is the length of the output sequence, and so is not appropriate for generating longer sequences. NetStateObject can be used to generate with time complexity of n.
First, a decoder is created that takes a single character and predicts the next character. The recurrent state of the GatedRecurrentLayer is handled by a NetStateObject at a later point.
Define a "Class" encoder and decoder that will encode and decode individual characters, as well as the special classes to indicate the start and end of the string:
Define a net that takes a single character, runs one step of the GatedRecurrentLayer, and produces a single softmax prediction:
This predictor has an internal recurrent state, as revealed by Information:
Create a function that uses NetStateObject to memorize the internal recurrent state, which is seeded from the code produced by the trained encoder:
Integer Sorting
In this example, a net is trained to sort lists of integers. For example, the net takes as input {3,2,4} and needs to return {2,3,4}. This example also demonstrates the use of a AttentionLayer, which significantly improves the performance of neural nets on many sequence learning tasks.
Optical Character Recognition (OCR) on a Toy Dataset
The optical character recognition problem is to take an image containing a sequence of characters and return the list of characters. One simple approach is to preprocess the image to produce images containing only a single character and do classification. This is a fragile approach and completely fails for domains such as cursive handwriting, where the characters run together.
First, generate training and test data, which consists of images of words and the corresponding word string:
Take a RandomSample of the training set:
Define a net that takes an image and then treats the width dimension as a sequence dimension. A sequence of probability vectors over the width dimension is produced:
Simple RNN Trained on the bAbI QA Dataset
Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset using a recurrent network.
Train the network for three training rounds. NetTrain will automatically attach a CrossEntropyLossLayer using the same classes that were provided to the decoder:
Memory Network Trained on the bAbI QA Dataset
Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset, using a memory network based on Sukhbaatar et al., "End-to-End Memory Networks", 2015.
The memory net has layers (such as TransposeLayer) that currently do not support variable-length sequences.
Convert all strings to lists of tokens and use left padding (which has better performance than right padding for this example) to ensure these lists are the same length:
Train the net using the "RMSProp" optimization method, which improves learning performance for this example:
Character-Level Language Model
The data is of the form of a classification problem: given a sequence of characters, predict the next one. A sample of the data:
Train the net. This can take up to an hour on a CPU, so use TargetDevice->"GPU" if you have an NVIDIA graphics card available. A modern GPU should be able to complete this example in about 7 minutes:
Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text:
An alternative and equivalent formulation of this learning problem, requiring only a single string as input, is to separate the last character from the rest of the sequence inside the net.
Use SequenceMostLayer and SequenceLastLayer in a graph to separate the last character from the rest:
Train this net on the input sequences from the original training data (technically, this means you end up training the net on slightly shorter sequences):
A more efficient way to train language models is to use "teacher forcing", in which the net simultaneously predicts the entire sequence, rather than just the last letter.
First, build the net that does prediction of an entire sequence at once. This differs from the previous prediction nets in that the LinearLayer is mapped and a matrix softmax is performed, instead of taking the last element and doing an ordinary vector softmax:
Now build the forcing network, which takes a target sentence and presents it to the network in a "staggered" fashion: for a length-26 sentence, present characters 1 through 25 to the net so that it produces predictions for characters 2 through 26, which are compared with the real characters via the CrossEntropyLossLayer to produce a loss:
Train the net on the input sequences from the original data. On a typical CPU, this should take around 15 minutes, compared to around 2 minutes on a modern GPU. As teacher forcing is a more efficient technique, you can afford to use a smaller number of training rounds:
Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text: