Training on Large Datasets

Neural nets are well-suited for being trained on very large datasets, even those that are too large to fit into memory. The most popular optimization algorithms for training neural nets (such as "ADAM" or "RMSProp" in NetTrain) are variations of an approach called stochastic gradient descent. In this approach, small batches of data are randomly sampled from the full training dataset and used to perform a parameter update. Thus, neural nets are an example of an online learning algorithm, which does not require the entire training dataset to be in memory. This is in contrast to methods such as the Support Vector Machine (SVM) and Random Forest algorithms, which usually require the entire dataset to reside in memory during training.

However, special handling is required if NetTrain is to be used on a dataset that does not fit into memory, as the full training dataset cannot be loaded into a Wolfram Language session. There are two approaches to training on such large datasets. The first approach is for users to write a generator function f that, when evaluated, can load a single batch of data from an external source such as a disk or database. NetTrain[net,f,] calls f at each training batch iteration, thus only keeping a single batch of training data in memory. This is the most flexible approach to doing out-of-core learning. However, it is up to the user to ensure that the evaluation of f is fast enough to avoid slowing down net training (which is particularly important when training on extremely powerful GPUs) and that it is correctly sampling from the full training dataset.

The second approach works in the special case of a training dataset consisting of image or audio files. In this case, encoders such as the "Image" NetEncoder can efficiently read and preprocess images stored on disk.

Find a file in ExampleData:
In[1]:=
Click for copyable input
Out[1]
Apply the encoder to the file:
In[2]:=
Click for copyable input
Out[2]

NetTrain[net,{File[]->,},] then automatically does out-of-core learning on a dataset where File[] represents an out-of-core image or audio file.

Some general performance tips for doing out-of-core learning:

Out-of-Core Training on MNIST

This example shows how to train a net on the MNIST dataset when the images are stored on disk as "JPEG" files rather than stored in memory. Although it is hardly necessary to training MNIST, this method can be used to train nets on terabyte-scale image datasets such as the ImageNet dataset.

We are able to use the special syntax NetTrain[net,{File[]class1,},] as long as we attach an "Image" NetEncoder to the input port of net. We also show how to reproduce this using the more flexible NetTrain generator function syntax.

Preparing the Data

We first need to create an out-of-core version of the MNIST that is of the form {File[]class1,}.

Obtain the MNIST training and test sets:
In[1]
Click for copyable input
In[2]
Click for copyable input
Out[2]

Let us save the images as "JPEG" files in the default temporary directory $TemporaryDirectory. Each exported file also requires a unique name. A good approach to creating unique names is to use the Hash function, which returns a unique hash for every unique image.

Define a function that exports an image to a "JPEG" file in the $TemporaryDirectory:
In[3]
Click for copyable input
Export an image:
In[4]
Click for copyable input
Out[4]

When training on images, it is usually the case that the net requires the images to be a single size, color space, etc. In the case of LeNet, the images are expected to be grayscale and of dimensions 2828. It will usually increase the speed of training if the images that will be read from disk are already conformed to what the net expects, rather than needing to conform the images every time they are loaded from disk. If images are not already conformed, modifying exportImage to conform the images using ConformImages is recommended.

We can now map the exporting function across the training and test sets. One optimization is to do the exporting in parallel, using ParallelMap rather than Map.

Export the images and create new training and test sets of the form {File[]->class,}:
In[5]
Click for copyable input
In[6]
Click for copyable input
Display a RandomSample of the new training set:
In[7]
Click for copyable input
Out[7]

Now only the references to the image files need to be kept in memory, which is much smaller than keeping the images themselves in memory.

Obtain the ByteCount of the two training sets:
In[8]
Click for copyable input
Out[8]
In[9]
Click for copyable input
Out[9]

Simple Out-of-Core Training

Define a convolutional neural network that has an "Image" NetEncoder attached to the input port:
In[10]
Click for copyable input
Out[10]

As this net has an "Image" NetEncoder attached to the input port, it can take images represented as File objects as input.

Obtain a file from the training set:
In[11]
Click for copyable input
Out[11]
Initialize the net with NetInitialize and apply it to the file:
In[12]
Click for copyable input
Out[12]

Training the net on data of the form {File[]->class1,} is exactly the same as training on data of the form {image1->class1,}.

Train the net for three training rounds:
In[13]
Click for copyable input
Out[13]
Evaluate the trained network directly on images randomly sampled from the validation set:
In[14]
Click for copyable input
Out[14]
Obtain the accuracy of the trained net on the test set:
In[15]
Click for copyable input
Out[15]

Training with a Generator Function

We can also use the more general generator syntax for NetTrain. This approach is more flexible, allowing custom image preprocessing, data augmentation, etc.

Define a generator function that returns batches of training data:
In[16]
Click for copyable input
Out[16]
Generate a batch of 4 training examples:
In[17]
Click for copyable input
Out[17]
Train the net with the generator function:
In[18]
Click for copyable input
Out[18]
Obtain the accuracy of the trained net on the test set:
In[19]
Click for copyable input
Out[19]

Import Versus NetEncoder Performance

Using a NetEncoder for loading data is usually much faster than writing a data loader using top-level Wolfram Language code. As an example, let us compare a simple image importer to the "Image" NetEncoder.

Define a custom image data loader:
In[20]
Click for copyable input
Define an "Image" NetEncoder:
In[21]
Click for copyable input
Out[21]

Each of these encoders will produce a rank-4 numeric tensor when applied to a list of input files.

Create a batch of 4 files:
In[22]
Click for copyable input
Out[22]
Apply both image encoders to a list of files and obtain the output dimensions and time taken to evaluate:
In[23]
Click for copyable input
Out[23]
In[24]
Click for copyable input
Out[24]
Obtain the speedup of the "Image" NetEncoder compared to the custom image encoder:
In[25]
Click for copyable input
Out[25]

As can be seen, the "Image" NetEncoder is more than 100 times faster than a custom image encoder! The difference is even greater if the batch size is increased.

Using a MongoDB Database

A MongoDB database is one solution for storing large datasets. The Wolfram Language has the MongoLink package for interacting with MongoDB databases.

This example shows how to train a net on the toy Fisher Iris dataset stored in a MongoDB database. Only small batches of data will be randomly sampled from the database during each training iteration. Thus, this method scales to datasets that cannot be stored in memory.

We will assume you are familiar with MongoLink and MongoDB databases. If not, reading the MongoLink Introduction tutorial is recommended before continuing.

It is also assumed that a MongoDB server is running on your local machine at the default host and port. For platform-dependent instructions for running a MongoDB server locally, see this.

Data Insertion

The dataset we are using is the Fisher Iris dataset, where the task is to classify a flower into four classes based on some numerical features.

Obtain the Fisher Iris dataset:
In[1]
Click for copyable input
In[2]
Click for copyable input
Out[2]

Now we insert the training data into a MongoDB database.

Load MongoLink:
In[3]
Click for copyable input
Create a client connection using the default host "localhost" and port 27017 (this is the default hostname and port when running the MongoDB server on your local machine):
In[4]
Click for copyable input
Out[4]

Let us create a MongoDB collection named "WolframNNTutorialCollection" in the database "WolframNNTutorialDatabase".

Create a collection using MongoGetCollection:
In[5]
Click for copyable input
Out[5]

If the collection and database do not yet exist, they will be created when we first insert data into them.

Convert the training data into a list of MongoDB documents:
In[6]
Click for copyable input
In[7]
Click for copyable input
Out[7]
Insert the training data into the collection:
In[8]
Click for copyable input
Out[8]
Take a random sample of the collection to verify that the data was inserted:
In[9]
Click for copyable input
Out[9]

Build a Classification Net

Now let us build a net that can perform classification on the dataset. We first require the list of all possible classes, which is best generated by a database query.

Build a list of the unique labels to which each example is assigned using MongoCollectionDistinct:
In[10]
Click for copyable input
Out[10]
Create a NetChain to perform the regression, using a "Class" decoder to interpret the output of the net as probabilities for each class:
In[11]
Click for copyable input
Out[11]

Construct a Generator Function

We need to define a generator function to use in NetTrain. This function needs to randomly sample documents from the "WolframNNTutorialCollection".

Use MongoCollectionAggregate with the "$sample" aggregation operator to obtain two random samples from the "WolframNNTutorialCollection" collection:
In[12]
Click for copyable input
Out[12]
This is equivalent to using RandomSample on the collection:
In[13]
Click for copyable input
Out[13]

However, we do not need the "_id" field. We can remove it either by modifying the Wolfram Language output or, more efficiently, by adding the "$project" operator to the aggregation pipeline.

Read from the collection with the "_id" field removed:
In[14]
Click for copyable input
Out[14]
Define a generator function for use in NetTrain:
In[15]
Click for copyable input
This generator function correctly returns a randomly sampled list of examples when the batch size is specified:
In[16]
Click for copyable input
Out[16]

There are two main valid forms of training data that generator functions can produce: either a list of example associations {<|key1val11,key2val21,|>,<|key1val12,|>,} produced by the generator, or a single Association where each key has a list of example values <|key1->{val11,val12,},key2{val21,val22,},|>. One form of the generator function output out can be converted to the other via Transpose[out,AllowedHeads->All].

MongoDB can directly produce the second form of grouping examples together using the $group aggregation stage. This is often much more efficient than producing the first form.

Define a generator function that returns batches of examples grouped together:
In[17]
Click for copyable input
Generate a batch of two examples using the second generator:
In[18]
Click for copyable input
Out[18]
The grouped generator is faster at producing batches of size 64:
In[19]
Click for copyable input
Out[19]
In[20]
Click for copyable input
Out[20]
Obtain the samples generated per second by the second generator:
In[21]
Click for copyable input
Out[21]

Train the Net

Train the net with the generator function:
In[22]
Click for copyable input
Out[22]

There is an issue with this approach: the performance on the test set is computed after every batch is trained on, compared to the usual case where it is computed after a single pass through the entire training dataset (a round). When using a generator, NetTrain has no idea what the size of a round should be, unless we specify it explicitly.

Obtain the total number of examples in the collection using MongoCollectionCount:
In[23]
Click for copyable input
Out[23]
Train the net with a specified round size for 2000 rounds:
In[24]
Click for copyable input
Out[24]
Produce a ClassifierMeasurementsObject to test the accuracy of the trained net on the test set:
In[25]
Click for copyable input
In[26]
Click for copyable input
Out[26]

Related Tutorials