Computer Vision
MNIST Digit Classification
Train a digit recognizer on the MNIST database of handwritten digits using a convolutional neural network.
In[1]:=1

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ciy9ad
Out[1]=1

In[4]:=4

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-rf8jpm
In[6]:=6

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-fvt7dh
Out[6]=6

In[7]:=7

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-dqauc9
Out[7]=7

Train the network for three training rounds. NetTrain will automatically attach a CrossEntropyLossLayer using the same classes that were provided to the decoder:
In[8]:=8

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-7811eo
Out[8]=8

Out[9]=9

In[10]:=10

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-2tpldc
Out[11]=11

In[12]:=12

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-9pli0l
Out[12]=12

Use NetMeasurements to test the classification performance of the trained net on the test set:
In[13]:=13

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-mfkq4g
Out[13]=13

In[14]:=14

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-e6ty96
Out[14]=14

CIFAR-10 Object Classification
Using the CIFAR-10 database of labeled images, train a convolutional net to predict the class of each object. First, obtain the training data:
In[74]:=74

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-3ir6ur
Out[76]=76

In[77]:=77

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-c6bnro
Out[77]=77

In[78]:=78

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-lv98f
Out[78]=78

In[79]:=79

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-s8rybc
Out[79]=79

Out[80]=80

In[81]:=81

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-64yxli
Out[81]=81

In[82]:=82

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-khrlvj
Out[82]=82

From a random sample, select the images for which the net produces highest and lowest entropy predictions. High-entropy inputs can be interpreted as those for which the net is most uncertain about the correct class:
In[83]:=83

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-lus6vq
In[84]:=84

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-mknuoh
In[85]:=85

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ggy6ea
Out[85]=85

Out[86]=86

In[1]:=1

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-zdoft9
Create a training set by sampling pairs of images and associating them with True if their labels are different and False if their labels are the same:
In[2]:=2

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-nycr52
In[3]:=3

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-pgcdrq
Out[4]=4

In[7]:=7

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-una66r
Out[7]=7

In[6]:=6

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ufuiqy
Out[6]=6

In[8]:=8

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-tmhvkk
Out[8]=8

Out[9]=9

Apply the network to a list of pairs of digits to compute their distances under the embedding. Digits with the same label have small distances:
In[10]:=10

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-mk7wey
Out[10]=10

In[11]:=11

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-679ty0
Out[11]=11

In[13]:=13

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-pysmgu
Out[13]=13

In[14]:=14

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-zlofif
Compute their embeddings and plot them. Digits with the same label are clustered under the learned embedding:
In[15]:=15

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ftweak
In[16]:=16

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-btvncb
Out[16]=16

Create a new image with the content of one image and in the style of another image. This implementation follows the method described in Gatys et al., "A Neural Algorithm of Artistic Style".
In[101]:=101

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-qgfzn1
To create the image that is a mix of both of these images, start by obtaining a pre-trained image classification network:
In[103]:=103

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-n1w63v
In[104]:=104

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-0dx446
Out[104]=104

There are three loss functions used. The first loss ensures that the content of the synthesized image is similar to that of the content image:
In[105]:=105

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-e02e2h
Out[105]=105

The second loss ensures that the style of the synthesized image is similar to that of the style image. Style similarity is defined as the mean-squared difference between the Gram matrices of the input and target:
In[106]:=106

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-znozp1
In[107]:=107

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-5iwui
Out[107]=107

The third loss ensures that the magnitude of intensity changes across adjacent pixels in the synthesized image is small. This helps the synthesized image look more natural:
In[108]:=108

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-wvfl97
In[109]:=109

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-cupila
Out[109]=109

Define a function that creates the final training net for any content and style image. This function also creates a random initial image:
In[110]:=110

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ybdg6z
Define a NetDecoder for visualizing the predicted image:
In[111]:=111

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-1n1mqf
Out[111]=111

In[112]:=112

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-qmt259
Out[112]=112

The training data consists of features extracted from the content and style images. Define a feature extraction function:
In[113]:=113

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-4zdmb2
In[114]:=114

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-2jjw6f
Create the training net whose input dimensions correspond to the content and style image dimensions:
In[115]:=115

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-jcw4gq
When training, the three losses are weighted differently to set the relative importance of the content and style. These values might need to be changed with different content and style images. Create a loss specification that defines the final loss as a combination of the three losses:
In[116]:=116

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-2zfnhh
Optimize the image using NetTrain. LearningRateMultipliers are used to freeze all parameters in the net except for the NetArrayLayer. The training is best done on a GPU, as it will take up to an hour to get good results with CPU training. The training can be stopped at any time via Evaluation ▶ Abort Evaluation:
In[118]:=118

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ho0l3
Out[118]=118

Out[119]=119

Extract the final image from the NetArrayLayer of the trained net:
In[120]:=120

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-oamwpg
Out[120]=120

Semantic Segmentation on a Toy Text Dataset
Train a net that classifies every pixel in an image of a word as being part of the background or part of one of the letters a through z.
First, generate training and test data, which consists of images of words and the corresponding "mask" integer matrices that label each pixel:
In[121]:=121

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-3uov4v
In[122]:=122

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-szwajk
In[124]:=124

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-4nbo5c
In[127]:=127

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ggen5g
Out[128]=128

In[129]:=129

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-dr76pi
In[130]:=130

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-cgh7vq
Out[130]=130

In[131]:=131

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-d48dyn
Out[131]=131

In[132]:=132

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-niyrz3
Out[132]=132

In[133]:=133

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-33ta3b
Out[133]=133

Define a convolutional net that takes an image and returns a probability vector for every pixel in the image:
In[134]:=134

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-8d4gqw
In[135]:=135

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ktye2w
Out[135]=135

In[138]:=138

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-ss7o2x
Out[138]=138

Out[139]=139

In[140]:=140

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-8081hp
In[141]:=141

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-mpmi3e
Out[141]=141

In[142]:=142

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-6ttkhv
Out[142]=142

In[143]:=143

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-xp9j4a
Out[143]=143

In[144]:=144

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-t1rz0e
Out[145]=145

Define a function that rasterizes a string and then applies the trained net to give a matrix of indices corresponding to the most likely class of each pixel:
In[146]:=146

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-rbq4ib
In[147]:=147

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-x4591t
Out[147]=147

In[148]:=148

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-juaqnc
Out[148]=148

In[149]:=149

✖
https://wolfram.com/xid/0d8wf24zv8hlkf4tsmo3dge-zor06i
Out[149]=149
