Training on Large Datasets

Neural nets are well-suited for being trained on very large datasets, even those that are too large to fit into memory. The most popular optimization algorithms for training neural nets (such as "ADAM" or "RMSProp" in NetTrain) are variations of an approach called stochastic gradient descent. In this approach, small batches of data are randomly sampled from the full training dataset and used to perform a parameter update. Thus, neural nets are an example of an online learning algorithm, which does not require the entire training dataset to be in memory. This is in contrast to methods such as the Support Vector Machine (SVM) and Random Forest algorithms, which usually require the entire dataset to reside in memory during training.
However, special handling is required if NetTrain is to be used on a dataset that does not fit into memory, as the full training dataset cannot be loaded into a Wolfram Language session. There are two approaches to training on such large datasets. The first approach is for users to write a generator function f that, when evaluated, can load a single batch of data from an external source such as a disk or database. NetTrain[net,f,] calls f at each training batch iteration, thus only keeping a single batch of training data in memory. This is the most flexible approach to doing out-of-core learning. However, it is up to the user to ensure that the evaluation of f is fast enough to avoid slowing down net training (which is particularly important when training on extremely powerful GPUs) and that it is correctly sampling from the full training dataset.
The second approach works in the special case of a training dataset consisting of image or audio files. In this case, encoders such as the "Image" NetEncoder can efficiently read and preprocess images stored on disk.
Find a file in ExampleData:
Apply the encoder to the file:
NetTrain[net,{File[]->,},] then automatically does out-of-core learning on a dataset where File[] represents an out-of-core image or audio file.
Some general performance tips for doing out-of-core learning:
Out-of-Core Training on MNIST
This example shows how to train a net on the MNIST dataset when the images are stored on disk as "JPEG" files rather than stored in memory. Although it is hardly necessary to training MNIST, this method can be used to train nets on terabyte-scale image datasets such as the ImageNet dataset.
We are able to use the special syntax NetTrain[net,{File[]class1,},] as long as we attach an "Image" NetEncoder to the input port of net. We also show how to reproduce this using the more flexible NetTrain generator function syntax.

Preparing the Data

We first need to create an out-of-core version of the MNIST that is of the form {File[]class1,}.
Obtain the MNIST training and test sets:
Let us save the images as "JPEG" files in the default temporary directory $TemporaryDirectory. Each exported file also requires a unique name. A good approach to creating unique names is to use the Hash function, which returns a unique hash for every unique image.
Define a function that exports an image to a "JPEG" file in the $TemporaryDirectory:
Export an image:
When training on images, it is usually the case that the net requires the images to be a single size, color space, etc. In the case of LeNet, the images are expected to be grayscale and of dimensions 2828. It will usually increase the speed of training if the images that will be read from disk are already conformed to what the net expects, rather than needing to conform the images every time they are loaded from disk. If images are not already conformed, modifying exportImage to conform the images using ConformImages is recommended.
We can now map the exporting function across the training and test sets. One optimization is to do the exporting in parallel, using ParallelMap rather than Map.
Export the images and create new training and test sets of the form {File[]->class,}:
Display a RandomSample of the new training set:
Now only the references to the image files need to be kept in memory, which is much smaller than keeping the images themselves in memory.
Obtain the ByteCount of the two training sets:

Simple Out-of-Core Training

Define a convolutional neural network that has an "Image" NetEncoder attached to the input port:
As this net has an "Image" NetEncoder attached to the input port, it can take images represented as File objects as input.
Obtain a file from the training set:
Initialize the net with NetInitialize and apply it to the file:
Training the net on data of the form {File[]->class1,} is exactly the same as training on data of the form {image1->class1,}.
Train the net for three training rounds:
Evaluate the trained network directly on images randomly sampled from the validation set:
Obtain the accuracy of the trained net on the test set:

Training with a Generator Function

We can also use the more general generator syntax for NetTrain. This approach is more flexible, allowing custom image preprocessing, data augmentation, etc.
Define a generator function that returns batches of training data:
Generate a batch of 4 training examples:
Train the net with the generator function:
Obtain the accuracy of the trained net on the test set:

Import Versus NetEncoder Performance

Using a NetEncoder for loading data is usually much faster than writing a data loader using top-level Wolfram Language code. As an example, let us compare a simple image importer to the "Image" NetEncoder.
Define a custom image data loader:
Define an "Image" NetEncoder:
Each of these encoders will produce a rank-4 array when applied to a list of input files.
Create a batch of 4 files:
Apply both image encoders to a list of files and obtain the output dimensions and time taken to evaluate:
Obtain the speedup of the "Image" NetEncoder compared to the custom image encoder:
As can be seen, the "Image" NetEncoder is more than 400 times faster than a custom image encoder! The difference is even greater if the batch size is increased.
Using a MongoDB Database
A MongoDB database is one solution for storing large datasets. The Wolfram Language has the MongoLink package for interacting with MongoDB databases.
This example shows how to train a net on the toy Fisher Iris dataset stored in a MongoDB database. Only small batches of data will be randomly sampled from the database during each training iteration. Thus, this method scales to datasets that cannot be stored in memory.
We will assume you are familiar with MongoLink and MongoDB databases. If not, reading the MongoLink Introduction tutorial is recommended before continuing.
It is also assumed that a MongoDB server is running on your local machine at the default host and port. For platform-dependent instructions for running a MongoDB server locally, see this.

Data Insertion

The dataset we are using is the Fisher Iris dataset, where the task is to classify a flower into four classes based on some numerical features.
Obtain the Fisher Iris dataset:
Now we insert the training data into a MongoDB database.
Load MongoLink:
Create a client connection using the default host "localhost" and port 27017 (this is the default hostname and port when running the MongoDB server on your local machine):
Let us create a MongoDB collection named "WolframNNTutorialCollection" in the database "WolframNNTutorialDatabase".
Create a collection using MongoGetCollection:
If the collection and database do not yet exist, they will be created when we first insert data into them.
Convert the training data into a list of MongoDB documents:
Insert the training data into the collection:
Take a random sample of the collection to verify that the data was inserted:

Build a Classification Net

Now let us build a net that can perform classification on the dataset. We first require the list of all possible classes, which is best generated by a database query.
Build a list of the unique labels to which each example is assigned using MongoCollectionDistinct:
Create a NetChain to perform the regression, using a "Class" decoder to interpret the output of the net as probabilities for each class:

Construct a Generator Function

We need to define a generator function to use in NetTrain. This function needs to randomly sample documents from the "WolframNNTutorialCollection".
Use MongoCollectionAggregate with the "$sample" aggregation operator to obtain two random samples from the "WolframNNTutorialCollection" collection:
This is equivalent to using RandomSample on the collection:
However, we do not need the "_id" field. We can remove it either by modifying the Wolfram Language output or, more efficiently, by adding the "$project" operator to the aggregation pipeline.
Read from the collection with the "_id" field removed:
Define a generator function for use in NetTrain:
This generator function correctly returns a randomly sampled list of examples when the batch size is specified:
There are two main valid forms of training data that generator functions can produce: either a list of example associations {<|key1val11,key2val21,|>,<|key1val12,|>,} produced by the generator, or a single Association where each key has a list of example values <|key1->{val11,val12,},key2{val21,val22,},|>. One form of the generator function output out can be converted to the other via Transpose[out,AllowedHeads->All].
MongoDB can directly produce the second form of grouping examples together using the $group aggregation stage. This is often much more efficient than producing the first form.
Define a generator function that returns batches of examples grouped together:
Generate a batch of two examples using the second generator:
The grouped generator is faster at producing batches of size 64:
Obtain the samples generated per second by the second generator:

Train the Net

Train the net with the generator function:
There is an issue with this approach: the performance on the test set is computed after every batch is trained on, compared to the usual case where it is computed after a single pass through the entire training dataset (a round). When using a generator, NetTrain has no idea what the size of a round should be, unless we specify it explicitly.
Obtain the total number of examples in the collection using MongoCollectionCount:
Train the net with a specified round size for 2000 rounds:
Test the accuracy of the trained net on the test set: