Sequence Learning and NLP with Neural Networks

Sequence learning refers to a variety of related tasks that neural nets can be trained to perform. What all these tasks have in common is that the input to the net is a sequence of some kind. This input is usually variable length, meaning that the net can operate equally well on short or long sequences.

Sequence Regression

Integer Addition

In this example, a net is trained to add two two-digit integers together. What makes the problem hard is that the inputs are strings rather than numeric values, whereas the output is a numeric value. For example, the net takes as input "25+33" and needs to return a real number close to 58. Note that the input is variable length, as the training data contains examples of length 3 ("5+5"), length 4 ("5+10") and length 5 ("10+10").

Create training data based on strings that describe two-digit additions and the corresponding numeric result:
In[1]
Click for copyable input
Out[1]
Create a net composed of a chain of recurrent layers to read an input string and predict the numeric result:
In[2]
Click for copyable input
Train the network:
In[3]
Click for copyable input
Out[3]
Apply the trained network to a list of inputs:
In[4]
Click for copyable input
Out[4]

Sequence Classification

Sentiment Analysis

Train a classifier that classifies movie review snippets as "positive" or "negative".

First, obtain the training and test data:
In[1]
Click for copyable input
In[2]
Click for copyable input
Out[2]
Define a net that takes a string of words as input and returns either "positive" or "negative":
In[3]
Click for copyable input
Out[3]
Train the net for two training rounds:
In[4]
Click for copyable input
Out[4]
Evaluate the trained net on a example from the test set, obtaining the probabilities:
In[5]
Click for copyable input
Out[5]
In[6]
Click for copyable input
Out[6]

Sequence-to-Sequence Learning

Sequence-to-sequence learning is a learning task where both the input and the predicted output are sequences. Tasks such as translating German to English, transcribing speech from an audio file, sorting lists of numbers, etc. are examples of this task.

Integer Addition with Fixed-Length Output

This example demonstrates how to train nets that take a variable-length sequence as input and predict a fixed-length sequence as output. We take a string that describes a sum, e.g. "250+123", and produce an output string that describes the answer, e.g. "373".

Create training data based on strings that describe three-digit additions and the corresponding result as a string. In order for the output to be fixed length, all outputs are padded to the maximum length of 4 (as the maximum value is 999+999=1998):
In[1]
Click for copyable input
In[2]
Click for copyable input
Out[2]
Define a net, taking a string as input and returning another string of length 4:
In[3]
Click for copyable input
Out[3]
Train the net for 40 training rounds:
In[4]
Click for copyable input
Out[4]
Apply the net to some examples:
In[5]
Click for copyable input
Out[5]
Obtain the accuracy of the net on the test set:
In[6]
Click for copyable input
Out[6]

Integer Addition with Variable-Length Output

This example demonstrates how to train nets on sequences where both the input and output are variable-length sequences, and those sequences are not the same. One sophisticated example of this task is translating English to German, but the example we cover is a simpler problem: taking a string that describes a sum, e.g. "250+123", and producing an output string that describes the answer, e.g. "373". The method used is based on I. Sutskever et al., "Sequence to Sequence Learning with Neural Networks", 2014.

Create training data based on strings that describe three-digit additions and the corresponding result as a string:
In[7]
Click for copyable input
In[8]
Click for copyable input
Out[8]
Create a NetEncoder that uses a special code for the start and end of a string, which will be used to indicate the beginning and end of the output sequence:
In[9]
Click for copyable input
Out[9]
Evaluate the encoder on an input string:
In[10]
Click for copyable input
Out[10]
Create a similar encoder for the input, which contains a "+":
In[11]
Click for copyable input
Out[11]
Define a net that takes a sequence of inputs and returns a single vector of size 150 as output:
In[12]
Click for copyable input
Out[12]
Define a net that takes an input vector of size 150 and a sequence of vectors as input and returns a sequence of vectors as output:
In[13]
Click for copyable input
Out[13]
Define a net with a CrossEntropyLossLayer and containing the encoder and decoder nets:
In[14]
Click for copyable input
Out[14]
Train the net (this procedure is often referred to as "teacher forcing"):
In[15]
Click for copyable input
Out[15]
In this case of three-digit integer addition, there are only 1999 possible outputs. It is feasible to calculate the loss for each possible output and find the one that minimizes the loss:
In[16]
Click for copyable input
Predict the output, given a number of inputs and the trained net:
In[17]
Click for copyable input
Out[17]
The loss for each output, given an input, can also be visualized:
In[18]
Click for copyable input
Out[18]

A more efficient way of obtaining predictions is to generate the output until the EndOfString virtual character is reached.

First, extract the trained "encoder" and "decoder" subnets from the trained NetGraph, and attach appropriate encoders and decoders:
In[19]
Click for copyable input
Out[19]
In[20]
Click for copyable input
Out[20]
Use a SequenceLastLayer to make a version of the decoder that only produces predictions for the last character of the answer, given the previous characters. Here a character decoder is attached for the probability vector using the same alphabet as the "Target" input:
In[21]
Click for copyable input
Out[21]
Apply the decoder to a partial answer:
In[22]
Click for copyable input
Out[22]
When the decoder predicts the end of the target string, an empty string will be produced:
In[23]
Click for copyable input
Out[23]
Now define a prediction function that takes the "encoder" and "decoder" nets and an input string. The function will compute successively longer results until the decoder claims to be finished:
In[24]
Click for copyable input
Evaluate this prediction function on input data:
In[25]
Click for copyable input
Out[25]
In[26]
Click for copyable input
Out[26]
This is an good approximation to the first method, which finds the exact maximum likelihood answer:
In[27]
Click for copyable input
Out[27]

The naive technique of generating by passing in each partial answer to the decoder net to derive the next character has time complexity of n2, where n is the length of the output sequence, and so is not appropriate for generating longer sequences. NetStateObject can be used to generate with time complexity of n.

First, a decoder is created that takes a single character and predicts the next character. The recurrent state of the GatedRecurrentLayer is handled by a NetStateObject at a later point.

Obtain the trained GatedRecurrentLayer and LinearLayer from the trained net:
In[28]
Click for copyable input
Out[28]
Out[28]
Define a "Class" encoder and decoder that will encode and decode individual characters, as well as the special classes to indicate the start and end of the string:
In[29]
Click for copyable input
Define a net that takes a single character, runs one step of the GatedRecurrentLayer, and produces a single softmax prediction:
In[30]
Click for copyable input
Out[30]
This predictor has an internal recurrent state, as revealed by NetInformation:
In[31]
Click for copyable input
Out[31]
Create a function that uses NetStateObject to memorize the internal recurrent state, which is seeded from the code produced by the trained encoder:
In[32]
Click for copyable input
Apply the function to some inputs to obtain predicted sequences:
In[33]
Click for copyable input
Out[33]
In[34]
Click for copyable input
Out[34]
Obtain the accuracy of the trained net on the test set:
In[35]
Click for copyable input
Out[35]

Integer Sorting

In this example, a net is trained to sort lists of integers. For example, the net takes as input {3,2,4} and needs to return {2,3,4}. This example also demonstrates the use of a SequenceAttentionLayer, which significantly improves the performance of neural nets on many sequence learning tasks.

Generate a test and training set consisting of lists of integers between 1 and 6:
In[36]
Click for copyable input
Display three random samples drawn from the training set:
In[37]
Click for copyable input
Out[37]
In[38]
Click for copyable input
Train the net:
In[39]
Click for copyable input
Out[39]
Use the net to sort a list of integers:
In[40]
Click for copyable input
Out[40]

Optical Character Recognition (OCR) on a Toy Dataset

The optical character recognition problem is to take an image containing a sequence of characters and return the list of characters. One simple approach is to preprocess the image to produce images containing only a single character and do classification. This is a fragile approach and completely fails for domains such as cursive handwriting, where the characters run together.

First, generate training and test data, which consists of images of words and the corresponding word string:
In[41]
Click for copyable input
In[42]
Click for copyable input
Split the dataset into a test set and a training set:
In[43]
Click for copyable input
Take a RandomSample of the training set:
In[44]
Click for copyable input
Out[44]
The list of characters used:
In[45]
Click for copyable input
Out[45]
The decoder is a beam search decoder with a beam size of 50:
In[46]
Click for copyable input
Out[46]
Define a net that takes an image and then treats the width dimension as a sequence dimension. A sequence of probability vectors over the width dimension is produced:
In[47]
Click for copyable input
Out[47]
Define a CTCLossLayer with a character NetEncoder attached to the target port:
In[48]
Click for copyable input
Out[48]
Train the net using the CTC loss:
In[49]
Click for copyable input
Out[49]
Evaluate the trained net on images from the test set:
In[50]
Click for copyable input
Out[50]
In[51]
Click for copyable input
Out[51]
Obtain the top 5 decodings for an image, along with the negative log-likelihood of each decoding:
In[52]
Click for copyable input
Out[52]

Question Answering

Simple RNN Trained on the bAbI QA Dataset

Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset using a recurrent network.

First, obtain the training and validation data:
In[1]
Click for copyable input
Out[1]
In[2]
Click for copyable input
In[3]
Click for copyable input
Out[3]
Obtain the list of classes used and dictionary used for the questions and contexts:
In[4]
Click for copyable input
Out[4]
In[5]
Click for copyable input
Out[5]
Define a net that takes a question string and a context string and returns an answer:
In[6]
Click for copyable input
Out[6]
Train the network for three training rounds. NetTrain will automatically attach a CrossEntropyLossLayer using the same classes that were provided to the decoder:
In[7]
Click for copyable input
Out[7]
Make a prediction with the trained net:
In[8]
Click for copyable input
Out[8]
Obtain the accuracy of the trained net on the test set:
In[9]
Click for copyable input
Out[9]

Memory Network Trained on the bAbI QA Dataset

Train a question-answering net on the first task (Single Supporting Fact) of the bAbI QA dataset, using a memory network based on Sukhbaatar et al., "End-to-End Memory Networks", 2015.

First, obtain the training and validation data:
In[10]
Click for copyable input
Out[10]
In[11]
Click for copyable input
In[12]
Click for copyable input
Out[12]

The memory net has layers (such as TransposeLayer) that currently do not support variable-length sequences.

Convert all strings to lists of tokens and use left padding (which has better performance than right padding for this example) to ensure these lists are the same length:
In[13]
Click for copyable input
In[14]
Click for copyable input
The "Context" input is now a list of length 68 padded with the Padding symbol:
In[15]
Click for copyable input
Out[15]
Obtain the list of classes used and dictionary used for the questions and contexts:
In[16]
Click for copyable input
Out[16]
In[17]
Click for copyable input
Out[17]
Define the net:
In[18]
Click for copyable input
Train the net using the "RMSProp" optimization method, which improves learning performance for this example:
In[19]
Click for copyable input
Out[19]
Obtain the accuracy of the net on the test data:
In[20]
Click for copyable input
Out[20]

Language Modeling

Character-Level Language Model

Training an English character-level language model.

First, create 300,000 training examples of 25 characters each from two novels:
In[1]
Click for copyable input
In[2]
Click for copyable input
The data is of the form of a classification problem: given a sequence of characters, predict the next one. A sample of the data:
In[3]
Click for copyable input
Out[3]
Obtain the list of all characters in the text:
In[4]
Click for copyable input
Out[4]
Define a net that takes in a string of characters and returns a prediction for the next character:
In[5]
Click for copyable input
Out[5]
Train the net. This can take up to an hour on a CPU, so use TargetDevice->"GPU" if you have an NVIDIA graphics card available. A modern GPU should be able to complete this example in about 7 minutes.
In[6]
Click for copyable input
Out[6]
In[7]
Click for copyable input
Out[7]
Predict the next character, given a sequence of characters:
In[8]
Click for copyable input
Out[8]
Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text:
In[9]
Click for copyable input
In[10]
Click for copyable input
Out[10]
You can get more interesting text by sampling from the probability distribution of predictions:
In[11]
Click for copyable input
In[12]
Click for copyable input
Out[12]

An alternative and equivalent formulation of this learning problem, requiring only a single string as input, is to separate the last character from the rest of the sequence inside the net.

Use SequenceMostLayer and SequenceLastLayer in a graph to separate the last character from the rest:
In[13]
Click for copyable input
Out[13]
Train this net on the input sequences from the original training data (technically, this means you end up training the net on slightly shorter sequences):
In[14]
Click for copyable input
Out[14]

A more efficient way to train language models is to use "teacher forcing", in which the net simultaneously predicts the entire sequence, rather than just the last letter.

First, build the net that does prediction of an entire sequence at once. This differs from the previous prediction nets in that the LinearLayer is mapped and a matrix softmax is performed, instead of taking the last element and doing an ordinary vector softmax:
In[15]
Click for copyable input
Out[15]
Now build the forcing network, which takes a target sentence and presents it to the network in a "staggered" fashion: for a length-26 sentence, present characters 1 through 25 to the net so that it produces predictions for characters 2 through 26, which are compared with the real characters via the CrossEntropyLossLayer to produce a loss:
In[16]
Click for copyable input
Out[16]
Train the net on the input sequences from the original data. On a typical CPU, this should take around 15 minutes, compared to around 2 minutes on a modern GPU. As teacher forcing is a more efficient technique, you can afford to use a smaller number of training rounds:
In[17]
Click for copyable input
Out[17]
Extract the prediction chain from the results:
In[18]
Click for copyable input
Out[18]
Build a single-character prediction chain from it:
In[19]
Click for copyable input
Out[19]
Test the predictor:
In[20]
Click for copyable input
Out[20]
Generate 100 characters of text, given a start text. Note that this method uses NetStateObject to efficiently generate long sequences of text:
In[21]
Click for copyable input
Out[21]
You can get more interesting text by sampling from the probability distribution of predictions:
In[22]
Click for copyable input
Out[22]

Related Tutorials