Training on Large Datasets
Neural nets are well-suited for being trained on very large datasets, even those that are too large to fit into memory. The most popular optimization algorithms for training neural nets (such as "ADAM" or "RMSProp" in NetTrain) are variations of an approach called stochastic gradient descent. In this approach, small batches of data are randomly sampled from the full training dataset and used to perform a parameter update. Thus, neural nets are an example of an online learning algorithm, which does not require the entire training dataset to be in memory. This is in contrast to methods such as the Support Vector Machine (SVM) and Random Forest algorithms, which usually require the entire dataset to reside in memory during training.
However, special handling is required if NetTrain is to be used on a dataset that does not fit into memory, as the full training dataset cannot be loaded into a Wolfram Language session. There are two approaches to training on such large datasets. The first approach is for users to write a generator function f that, when evaluated, can load a single batch of data from an external source such as a disk or database. NetTrain[net,f,…] calls f at each training batch iteration, thus only keeping a single batch of training data in memory. This is the most flexible approach to doing out-of-core learning. However, it is up to the user to ensure that the evaluation of f is fast enough to avoid slowing down net training (which is particularly important when training on extremely powerful GPUs) and that it is correctly sampling from the full training dataset.
The second approach works in the special case of a training dataset consisting of image or audio files. In this case, encoders such as the "Image" NetEncoder can efficiently read and preprocess images stored on disk.
Find a file in ExampleData:
NetTrain[net,{File[…]->…,…},…] then automatically does out-of-core learning on a dataset where File[…] represents an out-of-core image or audio file.
- Where possible, use NetEncoder, whose implementation has been highly optimized.
- NetTrain loads the next training data batch in parallel with doing a training iteration on the previous training data batch. Thus, the data loading and preprocessing only have an impact on training speed if it is slower than doing a single training iteration. This is highly dependent on the complexity of the net and the hardware the net is being trained on.
- When using a generator function f, ensure it is fast enough to avoid slowing down training. To do this, use the generator function to create a copy of the training data that will fit into memory using e.g. dataset=f[<"BatchSize"->1000 >], and measure the speed at which NetTrain operates on this data using NetTrain[net,dataset,"MeanInputsPerSecond",TimeGoal->30]. Then, compare this to the speed when using the data generator using NetTrain[net,f,"MeanInputsPerSecond",TimeGoal->30]. If the generator is slower, then either optimize f or accept this slowdown as the price of doing out-of-core training.
- Solid State Drives (SSDs) typically give much better performance than Hard Disk Drives (HDDs) when reading a random batch of files from disk. This is due to the random seek time being around an order of magnitude faster on SSDs versus HDDs.
- If the audio sampling rate of audio or the image dimensions of image files are significantly different from what the net is being trained on, significant training slowdowns can occur, as the expensive resampling operation is not done only once. The best approach is to create a new set of resampled image or audio files and train on these.
- Use an NVIDIA GPU for training. This is done using the TargetDevice option in NetTrain. Training on a GPU often speeds up training by an order of magnitude compared to only using a CPU. The reason for this is that net evaluations are typically much faster on GPUs. In addition, data loading and preprocessing can be faster, as the CPU can focus completely on this task.
This example shows how to train a net on the MNIST dataset when the images are stored on disk as "JPEG" files rather than stored in memory. Although it is hardly necessary to training MNIST, this method can be used to train nets on terabyte-scale image datasets such as the ImageNet dataset.
We are able to use the special syntax NetTrain[net,{File[…]class1,…},…] as long as we attach an "Image" NetEncoder to the input port of net. We also show how to reproduce this using the more flexible NetTrain generator function syntax.
Preparing the Data
We first need to create an out-of-core version of the MNIST that is of the form {File[…]class1,…}.
Let us save the images as "JPEG" files in the default temporary directory $TemporaryDirectory. Each exported file also requires a unique name. A good approach to creating unique names is to use the Hash function, which returns a unique hash for every unique image.
When training on images, it is usually the case that the net requires the images to be a single size, color space, etc. In the case of LeNet, the images are expected to be grayscale and of dimensions 2828. It will usually increase the speed of training if the images that will be read from disk are already conformed to what the net expects, rather than needing to conform the images every time they are loaded from disk. If images are not already conformed, modifying exportImage to conform the images using ConformImages is recommended.
We can now map the exporting function across the training and test sets. One optimization is to do the exporting in parallel, using ParallelMap rather than Map.
Export the images and create new training and test sets of the form {File[…]->class,…}:
Display a RandomSample of the new training set:
Now only the references to the image files need to be kept in memory, which is much smaller than keeping the images themselves in memory.
Obtain the ByteCount of the two training sets:
Simple Out-of-Core Training
As this net has an "Image" NetEncoder attached to the input port, it can take images represented as File objects as input.
Initialize the net with NetInitialize and apply it to the file:
Training the net on data of the form {File[…]->class1,…} is exactly the same as training on data of the form {image1->class1,…}.
Training with a Generator Function
We can also use the more general generator syntax for NetTrain. This approach is more flexible, allowing custom image preprocessing, data augmentation, etc.
Import Versus NetEncoder Performance
Using a NetEncoder for loading data is usually much faster than writing a data loader using top-level Wolfram Language code. As an example, let us compare a simple image importer to the "Image" NetEncoder.
Apply both image encoders to a list of files and obtain the output dimensions and time taken to evaluate:
As can be seen, the "Image" NetEncoder is more than 400 times faster than a custom image encoder! The difference is even greater if the batch size is increased.
A MongoDB database is one solution for storing large datasets. The Wolfram Language has the MongoLink package for interacting with MongoDB databases.
This example shows how to train a net on the toy Fisher Iris dataset stored in a MongoDB database. Only small batches of data will be randomly sampled from the database during each training iteration. Thus, this method scales to datasets that cannot be stored in memory.
We will assume you are familiar with MongoLink and MongoDB databases. If not, reading the MongoLink Introduction tutorial is recommended before continuing.
It is also assumed that a MongoDB server is running on your local machine at the default host and port. For platform-dependent instructions for running a MongoDB server locally, see this.
Data Insertion
The dataset we are using is the Fisher Iris dataset, where the task is to classify a flower into four classes based on some numerical features.
Create a client connection using the default host "localhost" and port 27017 (this is the default hostname and port when running the MongoDB server on your local machine):
Let us create a MongoDB collection named "WolframNNTutorialCollection" in the database "WolframNNTutorialDatabase".
Create a collection using MongoGetCollection:
If the collection and database do not yet exist, they will be created when we first insert data into them.
Build a Classification Net
Now let us build a net that can perform classification on the dataset. We first require the list of all possible classes, which is best generated by a database query.
Build a list of the unique labels to which each example is assigned using MongoCollectionDistinct:
Create a NetChain to perform the regression, using a "Class" decoder to interpret the output of the net as probabilities for each class:
Construct a Generator Function
We need to define a generator function to use in NetTrain. This function needs to randomly sample documents from the "WolframNNTutorialCollection".
Use MongoCollectionAggregate with the "$sample" aggregation operator to obtain two random samples from the "WolframNNTutorialCollection" collection:
This is equivalent to using RandomSample on the collection:
However, we do not need the "_id" field. We can remove it either by modifying the Wolfram Language output or, more efficiently, by adding the "$project" operator to the aggregation pipeline.
Define a generator function for use in NetTrain:
This generator function correctly returns a randomly sampled list of examples when the batch size is specified:
There are two main valid forms of training data that generator functions can produce: either a list of example associations {<key1val11,key2val21,… >,<key1val12,… >,…} produced by the generator, or a single Association where each key has a list of example values <key1->{val11,val12,…},key2{val21,val22,…},… >. One form of the generator function output out can be converted to the other via Transpose[out,AllowedHeads->All].
MongoDB can directly produce the second form of grouping examples together using the $group aggregation stage. This is often much more efficient than producing the first form.
Train the Net
There is an issue with this approach: the performance on the test set is computed after every batch is trained on, compared to the usual case where it is computed after a single pass through the entire training dataset (a round). When using a generator, NetTrain has no idea what the size of a round should be, unless we specify it explicitly.
Obtain the total number of examples in the collection using MongoCollectionCount: