Computer Vision

Image Classification

MNIST Digit Classification

Train a digit recognizer on the MNIST database of handwritten digits using a convolutional neural network.

First, obtain the training and validation data:
In[1]
Click for copyable input
Out[1]
In[2]
Click for copyable input
In[3]
Click for copyable input
Out[3]
Define a convolutional neural network that takes in 28×28 grayscale images as input:
In[4]
Click for copyable input
Out[4]
Train the network for three training rounds. NetTrain will automatically attach a CrossEntropyLossLayer using the same classes that were provided to the decoder:
In[5]
Click for copyable input
Out[5]
Evaluate the trained network directly on images randomly sampled from the validation set:
In[6]
Click for copyable input
Out[6]
Obtain probability estimates of all classes for a specific input:
In[7]
Click for copyable input
Out[7]
Create a ClassifierMeasurements object from the trained network and the validation set:
In[8]
Click for copyable input
Out[8]
Obtain the accuracy of the network on the validation set:
In[9]
Click for copyable input
Out[9]
Obtain a plot of the confusion matrix of the network on the validation set:
In[10]
Click for copyable input
Out[10]

CIFAR-10 Object Classification

Using the CIFAR-10 database of labeled images, train a convolutional net to predict the class of each object. First, obtain the training data:
In[11]
Click for copyable input
Out[11]
Extract the unique classes:
In[12]
Click for copyable input
Out[12]
Create a convolutional net that predicts the class, given an image:
In[13]
Click for copyable input
Out[13]
Train the net on the training data:
In[14]
Click for copyable input
Predict the most likely classes for a set of images:
In[15]
Click for copyable input
Out[15]
For a specific example, give the probabilities of the most likely label assignments:
In[16]
Click for copyable input
Out[16]
From a random sample, select the images for which the net produces highest and lowest entropy predictions. High-entropy inputs can be interpreted as those for which the net is most uncertain about the correct class:
In[17]
Click for copyable input
In[18]
Click for copyable input
In[19]
Click for copyable input
Out[19]
Out[19]

Learned Embedding

Learn an embedding of the digits in the MNIST dataset.

First, import the data and take only those examples with labels between 0 and 4:
In[1]
Click for copyable input
Create a training set by sampling pairs of images and associating them with True if their labels are different and False if their labels are the same:
In[2]
Click for copyable input
In[3]
Click for copyable input
Out[3]
Define a convolutional net to use as an embedding network:
In[4]
Click for copyable input
Construct the embedding net:
In[5]
Click for copyable input
Out[5]
Train the network:
In[6]
Click for copyable input
Out[6]
Apply the network to a list of pairs of digits to compute their distances under the embedding. Digits with the same label have small distances:
In[7]
Click for copyable input
Out[7]
Extract the embedding network:
In[8]
Click for copyable input
Out[8]
Compute the embedding of a digit:
In[9]
Click for copyable input
Out[9]
Sample 500 digits and group them by their labels:
In[10]
Click for copyable input
Compute their embeddings and plot them. Digits with the same label are clustered under the learned embedding:
In[11]
Click for copyable input
In[12]
Click for copyable input
Out[12]

Style Transfer

Create a new image with the content of a given image and in the style of another given image. This implementation follows the method described in Gatys et al., "A Neural Algorithm of Artistic Style".

An example content and style image:
In[1]
Click for copyable input
To create the image that is a mix of both of these images, start by obtaining a pre-trained image classification network:
In[2]
Click for copyable input
Take a subnet that will be used as a feature extractor for the style and content images:
In[3]
Click for copyable input
Out[3]
There are three loss functions used. The first loss ensures that the "content" is similar in the synthesized image and the content image:
In[4]
Click for copyable input
Out[4]
The second loss ensures that the "style" is similar in the synthesized image and the style image. Style similarity is defined as the mean-squared difference between the Gram matrices of the input and target:
In[5]
Click for copyable input
In[6]
Click for copyable input
Out[6]
The third loss ensures that the magnitude of intensity changes across adjacent pixels in the synthesized image is small. This helps the synthesized image look more natural:
In[7]
Click for copyable input
In[8]
Click for copyable input
Out[8]
Define a function that creates the final training net for any content and style image. This function also creates a random initial image:
In[9]
Click for copyable input
Define a NetDecoder for visualizing the predicted image:
In[10]
Click for copyable input
Out[10]
In[11]
Click for copyable input
Out[11]
The training data consists of features extracted from the content and style images. Define a feature extraction function:
In[12]
Click for copyable input
Create a training set consisting of a single example of a content and style feature:
In[13]
Click for copyable input
Create the training net whose input dimensions correspond to the content and style image dimensions:
In[14]
Click for copyable input
When training, the three losses are weighted differently to set the relative importance of the content and style. These values might need to be changed with different content and style images. Create a loss specification that defines the final loss as a combination of the three losses:
In[15]
Click for copyable input
Optimize the image using NetTrain. LearningRateMultipliers are used to freeze all parameters in the net except for the ConstantArrayLayer. The training is best done on a GPU, as it will take up to an hour to get good results with CPU training. The training can be stopped at any time via Evaluation Abort Evaluation:
In[16]
Click for copyable input
Out[16]
Extract the final image from the ConstantArrayLayer of the trained net:
In[17]
Click for copyable input
Out[17]
Click for copyable input

Semantic Segmentation (Pixel Classification)

Semantic Segmentation on a Toy Text Dataset

Train a net that classifies every pixel in an image of a word as being part of the background or part of one of the letters a through z.

First, generate training and test data, which consists of images of words and the corresponding "mask" integer matrices that label each pixel:
In[1]
Click for copyable input
In[2]
Click for copyable input
In[3]
Click for copyable input
In[4]
Click for copyable input
Out[4]
In[5]
Click for copyable input
Obtain the number of classes:
In[6]
Click for copyable input
Out[6]
Assign a color to each class index:
In[7]
Click for copyable input
Out[7]
Obtain a random sample of input images in the training set:
In[8]
Click for copyable input
Out[8]
Visualize the mask array of a single example, where each color represents a particular class:
In[9]
Click for copyable input
Out[9]
Define a convolutional net that takes an image and returns a probability vector for every pixel in the image:
In[10]
Click for copyable input
In[11]
Click for copyable input
GPU training is recommended, as CPU training could take up to an hour. Train the net:
In[12]
Click for copyable input
Out[12]
Create a prediction function that gives the index of the most likely class at each pixel:
In[13]
Click for copyable input
Predict the pixel classes of an image from the test set:
In[14]
Click for copyable input
Out[14]
In[15]
Click for copyable input
Out[15]
Plot the actual target mask of the example:
In[16]
Click for copyable input
Out[16]
Obtain a sorted list of probabilities for a specific pixel to be a certain class:
In[17]
Click for copyable input
Out[17]
Define a function that rasterizes a string and then applies the trained net to give a matrix of indices corresponding to the most likely class of each pixel:
In[18]
Click for copyable input
Some characters are easily distinguished from other characters:
In[19]
Click for copyable input
Out[19]
Other characters are harder:
In[20]
Click for copyable input
Out[20]
Obtain the pixel-level accuracy of the trained net on the test set:
In[21]
Click for copyable input
Out[21]

Related Tutorials