Sequence Learning and NLP with Neural Networks

Sequence learning refers to a variety of related tasks that neural nets can be trained to perform. What all these tasks have in common is that the input to the net is a sequence of some kind. This input is usually variable length, meaning that the net can operate equally well on short or long sequences.
What distinguishes the various sequence learning tasks is the form of the output of the net. Here, there is wide diversity of techniques, with corresponding forms of output:
We give simple examples of most of these techniques in this tutorial.
Sequence Regression

Integer Addition

In this example, a net is trained to add two two-digit integers together. What makes the problem hard is that the inputs are strings rather than numeric values, whereas the output is a numeric value. For example, the net takes as input "25+33" and needs to return a real number close to 58. Note that the input is variable length, as the training data contains examples of length 3 ("5+5"), length 4 ("5+10") and length 5 ("10+10").
Create training data based on strings that describe two-digit additions and the corresponding numeric result:
Create a net composed of a chain of recurrent layers to read an input string and predict the numeric result:
Train the network:
Apply the trained network to a list of inputs:
Sequence Classification

Sentiment Analysis

Train a classifier that classifies movie review snippets as "positive" or "negative".
First, obtain the training and test data:
Define a net that takes a string of words as input and returns either "positive" or "negative":
Train the net for two training rounds:
Evaluate the trained net on a example from the test set, obtaining the probabilities:
Sequence-to-Sequence Learning
Sequence-to-sequence learning is a learning task where both the input and the predicted output are sequences. Tasks such as translating German to English, transcribing speech from an audio file, sorting lists of numbers, etc. are examples of this task.

Integer Addition with Fixed-Length Output

This example demonstrates how to train nets that take a variable-length sequence as input and predict a fixed-length sequence as output. We take a string that describes a sum, e.g. "250+123", and produce an output string that describes the answer, e.g. "373".
Create training data based on strings that describe three-digit additions and the corresponding result as a string. In order for the output to be fixed length, all outputs are padded to the maximum length of 4 (as the maximum value is 999+999=1998):
Define a net, taking a string as input and returning another string of length 4:
Train the net for 40 training rounds:
Apply the net to some examples:
Obtain the accuracy of the net on the test set:

Integer Addition with Variable-Length Output

This example demonstrates how to train nets on sequences where both the input and output are variable-length sequences, and those sequences are not the same. One sophisticated example of this task is translating English to German, but the example we cover is a simpler problem: taking a string that describes a sum, e.g. "250+123", and producing an output string that describes the answer, e.g. "373". The method used is based on I. Sutskever et al., "Sequence to Sequence Learning with Neural Networks", 2014.
Create training data based on strings that describe three-digit additions and the corresponding result as a string:
Create a NetEncoder that uses a special code for the start and end of a string, which will be used to indicate the beginning and end of the output sequence:
Evaluate the encoder on an input string:
Create a similar encoder for the input, which contains a "+":
Define a net that takes a sequence of inputs and returns a single vector of size 150 as output:
Define a net that takes an input vector of size 150 and a sequence of vectors as input and returns a sequence of vectors as output:
Define a net with a CrossEntropyLossLayer and containing the encoder and decoder nets:
Train the net (this procedure is often referred to as "teacher forcing"):
In this case of three-digit integer addition, there are only 1999 possible outputs. It is feasible to calculate the loss for each possible output and find the one that minimizes the loss:
Predict the output, given a number of inputs and the trained net:
The loss for each output, given an input, can also be visualized:
A more efficient way of obtaining predictions is to generate the output until the EndOfString virtual character is reached.
First, extract the trained "encoder" and "decoder" subnets from the trained NetGraph, and attach appropriate encoders and decoders:
Use a SequenceLastLayer to make a version of the decoder that only produces predictions for the last character of the answer, given the previous characters. Here a character decoder is attached for the probability vector using the same alphabet as the "Target" input:
Apply the decoder to a partial answer:
When the decoder predicts the end of the target string, an empty string will be produced:
Now define a prediction function that takes the "encoder" and "decoder" nets and an input string. The function will compute successively longer results until the decoder claims to be finished:
Evaluate this prediction function on input data:
This is an good approximation to the first method, which finds the exact maximum likelihood answer:
The naive technique of generating by passing in each partial answer to the decoder net to derive the next character has time complexity of n2, where n is the length of the output sequence, and so is not appropriate for generating longer sequences. NetStateObject can be used to generate with time complexity of n.
First, a decoder is created that takes a single character and predicts the next character. The recurrent state of the GatedRecurrentLayer is handled by a NetStateObject at a later point.
Obtain the trained GatedRecurrentLayer and LinearLayer from the trained net:
Define a "Class" encoder and decoder that will encode and decode individual characters, as well as the special classes to indicate the start and end of the string:
Define a net that takes a single character, runs one step of the GatedRecurrentLayer, and produces a single softmax prediction:
This predictor has an internal recurrent state, as revealed by Information:
Create a function that uses NetStateObject to memorize the internal recurrent state, which is seeded from the code produced by the trained encoder:
Apply the function to some inputs to obtain predicted sequences:
Obtain the accuracy of the trained net on the test set:

Integer Sorting

In this example, a net is trained to sort lists of integers. For example, the net takes as input {3,2,4} and needs to return {2,3,4}. This example also demonstrates the use of a AttentionLayer, which significantly improves the performance of neural nets on many sequence learning tasks.
Generate a test and training set consisting of lists of integers between 1 and 6:
Display three random samples drawn from the training set:
Define a NetGraph with an AttentionLayer:
Train the net:
Use the net to sort a list of integers:

Optical Character Recognition (OCR) on a Toy Dataset

The optical character recognition problem is to take an image containing a sequence of characters and return the list of characters. One simple approach is to preprocess the image to produce images containing only a single character and do classification. This is a fragile approach and completely fails for domains such as cursive handwriting, where the characters run together.
First, generate training and test data, which consists of images of words and the corresponding word string:
Split the dataset into a test set and a training set:
Take a RandomSample of the training set:
The list of characters used:
The decoder is a beam search decoder with a beam size of 50:
Define a net that takes an image and then treats the width dimension as a sequence dimension. A sequence of probability vectors over the width dimension is produced:
Define a CTCLossLayer with a character NetEncoder attached to the target port:
Train the net using the CTC loss:
Evaluate the trained net on images from the test set:
Obtain the top 5 decodings for an image, along with the negative log-likelihood of each decoding:
Question Answering

Simple RNN Trained on the bAbI QA Dataset

Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset using a recurrent network.
First, obtain the training and validation data:
Obtain the list of classes used and dictionary used for the questions and contexts:
Define a net that takes a question string and a context string and returns an answer:
Train the network for three training rounds. NetTrain will automatically attach a CrossEntropyLossLayer using the same classes that were provided to the decoder:
Make a prediction with the trained net:
Obtain the accuracy of the trained net on the test set:

Memory Network Trained on the bAbI QA Dataset

Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset, using a memory network based on Sukhbaatar et al., "End-to-End Memory Networks", 2015.
First, obtain the training and validation data:
The memory net has layers (such as TransposeLayer) that currently do not support variable-length sequences.
Convert all strings to lists of tokens and use left padding (which has better performance than right padding for this example) to ensure these lists are the same length:
The "Context" input is now a list of length 68 padded with the Padding symbol:
Obtain the list of classes used and dictionary used for the questions and contexts:
Define the net:
Train the net using the "RMSProp" optimization method, which improves learning performance for this example:
Obtain the accuracy of the net on the test data:
Language Modeling

Character-Level Language Model

Training an English character-level language model.
First, create 300,000 training examples of 25 characters each from two novels:
The data is of the form of a classification problem: given a sequence of characters, predict the next one. A sample of the data:
Obtain the list of all characters in the text:
Define a net that takes in a string of characters and returns a prediction for the next character:
Train the net. This can take up to an hour on a CPU, so use TargetDevice->"GPU" if you have an NVIDIA graphics card available. A modern GPU should be able to complete this example in about 7 minutes:
Predict the next character, given a sequence of characters:
Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text:
You can get more interesting text by sampling from the probability distribution of predictions:
An alternative and equivalent formulation of this learning problem, requiring only a single string as input, is to separate the last character from the rest of the sequence inside the net.
Use SequenceMostLayer and SequenceLastLayer in a graph to separate the last character from the rest:
Train this net on the input sequences from the original training data (technically, this means you end up training the net on slightly shorter sequences):
A more efficient way to train language models is to use "teacher forcing", in which the net simultaneously predicts the entire sequence, rather than just the last letter.
First, build the net that does prediction of an entire sequence at once. This differs from the previous prediction nets in that the LinearLayer is mapped and a matrix softmax is performed, instead of taking the last element and doing an ordinary vector softmax:
Now build the forcing network, which takes a target sentence and presents it to the network in a "staggered" fashion: for a length-26 sentence, present characters 1 through 25 to the net so that it produces predictions for characters 2 through 26, which are compared with the real characters via the CrossEntropyLossLayer to produce a loss:
Train the net on the input sequences from the original data. On a typical CPU, this should take around 15 minutes, compared to around 2 minutes on a modern GPU. As teacher forcing is a more efficient technique, you can afford to use a smaller number of training rounds:
Extract the prediction chain from the results:
Build a single-character prediction chain from it:
Test the predictor:
Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text:
You can get more interesting text by sampling from the probability distribution of predictions: