Group Abstract

Message Boards

WOLFRAM COMMUNITY

10.4K Views

7 Replies

8 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Inverted MNIST Digit Data

gus s

Posted 6 years ago

I tried a simple NN on the MNIST Digit Data: resource = ResourceObject["MNIST"]; trainingData = ResourceData[resource, "TrainingData"]; testData = ResourceData[resource, "TestData"]; n = NetChain[{FlattenLayer[], 64, Ramp, 10, SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> NetEncoder[{"Image", {28, 28}, "Grayscale"}]] Trained it a short time: In[5]:= AbsoluteTiming[ t1 = NetTrain[n, trainingData, BatchSize -> 100, MaxTrainingRounds -> 4]] Out[5]= {18.2776, NetChain[ <> ]} Checked the accuracy: In[6]:= ClassifierMeasurements[t1, testData, "Accuracy"] Out[6]= 0.9176 Seems nothing unusual here. But then I tried the same NN with the same training with KERAS/Tensorflow. Results: Wall time: 7.58 s 10000/10000 [==============================] - 0s 30us/sample - loss: 0.1159 - acc: 0.9666 Time difference seems plausible, because I trained with a CPU under MMA and a TPU under KERAS, but where the difference in accuracy came from? 4.8 percentage points is quite a lot. Image of the Digits within MMA vs Image in Keras: Background in MMA ist White == 1, whereas in Keras is Black == 0. The latter is also the form of the original data from http://yann.lecun.com/exdb/mnist/. Reinversion is easy but take some time: itrain = MapAt[1 - # &, trainingData, {All, 1}]; itest = MapAt[1 - # &, testData, {All, 1}]; Train the NN again with the reinverted data: In[10]:= AbsoluteTiming[ t2 = NetTrain[n, itrain, BatchSize -> 100, MaxTrainingRounds -> 4]] Out[10]= {9.76811, NetChain[ <> ]} Check accuracy: In[11]:= ClassifierMeasurements[t2, itest, "Accuracy"] Out[11]= 0.9636 Not only the training time has halved, but also the accuracy has improved by over 4 percentage points and is now similar to the result with Keras! Has anyone an explanation, why the inversion has such a big impact?

POSTED BY: gus s

7 Replies

Sort By:

Martijn Froeling

Martijn Froeling, University Medical Center Utrecht

Posted 6 years ago

What I got from the discussion is that batch normalization scales a batch to be zero mean and 1 variance such that the scaling of the original data is irrelevant. So if an image is 0 background and 1 foreground or vise versa, or even 100 background and 200 foreground the batch normalization learns how to scale the data such that the network will alway behaves the same if the information in the images is similar.

POSTED BY: Martijn Froeling

Martijn Froeling

Martijn Froeling, University Medical Center Utrecht

Posted 6 years ago

With respect to the difference of the two inversion methods they are only the same up to the 5th decimal so there might be something going on there but no clue what. In[93]:= (1 - trainingData[[1, 1]]) === ColorNegate[trainingData[[1, 1]]] ImageData[1. - trainingData[[1, 1]]] === ImageData[ColorNegate[trainingData[[1, 1]]]] (Round[ImageData[1. - trainingData[[1, 1]]], 10.^-5]) === (Round[ ImageData[ColorNegate[trainingData[[1, 1]]]], 10.^-5]) Out[93]= False Out[94]= False Out[95]= True Regarding the inverting of the grayscale images. Although the Image head in Mathematica can be usefull it is verry verry slow. If you want to speed up treat images as what they are arrays of numbers, that is what any other language does as wel. As you can see this is almost 100x faster than 1-Image as you had and 5x faster as ColorNegate. In[58]:= AbsoluteTiming[ itrain1 = Map[ColorNegate[First[#]] -> Last[#] &, trainingData]; itest1 = Map[ColorNegate[First[#]] -> Last[#] &, testData]; ] Out[58]= {6.07861, Null} In[59]:= AbsoluteTiming[ itrain2 = Map[1 - ImageData[First[#]] -> Last[#] &, trainingData]; itest2 = Map[1 - ImageData[First[#]] -> Last[#] &, testData]; ] Out[59]= {1.08891, Null} This also holds for the training of the network, NetEncoder is slow. I have the strong fealing that this is done on the CPU in stead of the GPU. My CPU was number crunching some other stuff (~75% CPU usage) while running these examples and the difference is remarkable. n = NetChain[{FlattenLayer[], 64, Ramp, 10, SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> NetEncoder[{"Image", {28, 28}, "Grayscale"}]] n2 = NetChain[{FlattenLayer[], 64, Ramp, 10, SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> {28, 28}] Then running the training is almost 10x faster when not using Image+NetEncoder which is very important when training large problems. So when not using image the inversion is 5x faster and the training is 10x faster. So although mathematica can make your life easier with all kinds of encoders treat data as what is is, numbers in arrays! In[128]:= t2 = NetTrain[n, itrain1, All, BatchSize -> 500, MaxTrainingRounds -> 4, TargetDevice -> "GPU"]; t2["TotalTrainingTime"] ClassifierMeasurements[t2["TrainedNet"], itest1, "Accuracy"] Out[129]= 29.547 Out[130]= 0.9449 In[116]:= t3 = NetTrain[n2, itrain2, All, BatchSize -> 500, MaxTrainingRounds -> 4, TargetDevice -> "GPU"]; t3["TotalTrainingTime"] ClassifierMeasurements[t3["TrainedNet"], itest2, "Accuracy"] Out[117]= 2.12835 Out[118]= 0.9449 With regard why the inverted image is different than the normal images is because if you train the inverted image the most of the data is 0 while only very few values are non zero. The network trains a function to describe the data which is less complicated if the image is sparce (mostly 0). In[141]:= (Original data) Count[N@Flatten[ImageData[trainingData[[1, 1]]]], 0.] (inverted data) Count[Flatten[ImageData[1. - trainingData[[1, 1]]]], 0.] Out[141]= 2 Out[142]= 608 To test this hypothesis we can make the original data more sparce to see if this improves the training. We make the data more sparce by including more 0 values. In[170]:= AbsoluteTiming[ itrain0 = Map[ImageData[First[#]] -> Last[#] &, trainingData]; itest0 = Map[ImageData[First[#]] -> Last[#] &, testData]; ] AbsoluteTiming[ itrain0S = Map[Clip[(ImageData[First[#]] - 0.5), {0, 1}] -> Last[#] &, trainingData]; itest0S = Map[Clip[(ImageData[First[#]] - 0.5), {0, 1}] -> Last[#] &, testData]; ] (Original data) Count[N@Flatten[Clip[ImageData[trainingData[[1, 1]]], {0, 1}]], 0.] (Sparse original data) Count[N@Flatten[ Clip[ImageData[trainingData[[1, 1]]] - 0.5, {0, 1}]], 0.] Out[170]= {1.23638, Null} Out[171]= {1.81911, Null} Out[172]= 2 Out[173]= 125 So lets see if it helps In[144]:= t0 = NetTrain[n2, itrain0, All, BatchSize -> 500, MaxTrainingRounds -> 4, TargetDevice -> "GPU"]; t0["TotalTrainingTime"] ClassifierMeasurements[t0["TrainedNet"], itest0, "Accuracy"] Out[145]= 3.93681 Out[146]= 0.9053 In[167]:= t0S = NetTrain[n2, itrain0S, All, BatchSize -> 500, MaxTrainingRounds -> 4, TargetDevice -> "GPU"]; t0S["TotalTrainingTime"] ClassifierMeasurements[t0S["TrainedNet"], itest0S, "Accuracy"] Out[168]= 2.6613 Out[169]= 0.9109 So the less information the images contains the less is needed to describe the data so the easier it becomes for a NN to figure out what is going on. I attached the notebook so others can test this. Attachments: MNIST.nb

With respect to the difference of the two inversion methods they are only the same up to the 5th decimal so there might be something going on there but no clue what.

In[93]:= (1 - trainingData[[1, 1]]) === 
 ColorNegate[trainingData[[1, 1]]]
ImageData[1. - trainingData[[1, 1]]] === 
 ImageData[ColorNegate[trainingData[[1, 1]]]]
(Round[ImageData[1. - trainingData[[1, 1]]], 10.^-5]) === (Round[
   ImageData[ColorNegate[trainingData[[1, 1]]]], 10.^-5])

Out[93]= False

Out[94]= False

Out[95]= True

Regarding the inverting of the grayscale images. Although the Image head in Mathematica can be usefull it is verry verry slow. If you want to speed up treat images as what they are arrays of numbers, that is what any other language does as wel. As you can see this is almost 100x faster than 1-Image as you had and 5x faster as ColorNegate.

In[58]:= AbsoluteTiming[
 itrain1 = Map[ColorNegate[First[#]] -> Last[#] &, trainingData];
 itest1 = Map[ColorNegate[First[#]] -> Last[#] &, testData];
 ]

Out[58]= {6.07861, Null}

In[59]:= AbsoluteTiming[
 itrain2 = Map[1 - ImageData[First[#]] -> Last[#] &, trainingData];
 itest2 = Map[1 - ImageData[First[#]] -> Last[#] &, testData];
 ]

Out[59]= {1.08891, Null}

This also holds for the training of the network, NetEncoder is slow. I have the strong fealing that this is done on the CPU in stead of the GPU. My CPU was number crunching some other stuff (~75% CPU usage) while running these examples and the difference is remarkable.

n = NetChain[{FlattenLayer[], 64, Ramp, 10, SoftmaxLayer[]}, 
  "Output" -> NetDecoder[{"Class", Range[0, 9]}], 
  "Input" -> NetEncoder[{"Image", {28, 28}, "Grayscale"}]]
n2 = NetChain[{FlattenLayer[], 64, Ramp, 10, SoftmaxLayer[]}, 
  "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> {28, 28}]

Then running the training is almost 10x faster when not using Image+NetEncoder which is very important when training large problems. So when not using image the inversion is 5x faster and the training is 10x faster. So although mathematica can make your life easier with all kinds of encoders treat data as what is is, numbers in arrays!

In[128]:= 
t2 = NetTrain[n, itrain1, All, BatchSize -> 500, 
   MaxTrainingRounds -> 4, TargetDevice -> "GPU"];
t2["TotalTrainingTime"]
ClassifierMeasurements[t2["TrainedNet"], itest1, "Accuracy"]

Out[129]= 29.547

Out[130]= 0.9449

In[116]:= 
t3 = NetTrain[n2, itrain2, All, BatchSize -> 500, 
   MaxTrainingRounds -> 4, TargetDevice -> "GPU"];
t3["TotalTrainingTime"]
ClassifierMeasurements[t3["TrainedNet"], itest2, "Accuracy"]

Out[117]= 2.12835

Out[118]= 0.9449

With regard why the inverted image is different than the normal images is because if you train the inverted image the most of the data is 0 while only very few values are non zero. The network trains a function to describe the data which is less complicated if the image is sparce (mostly 0).

In[141]:= (*Original data*)
Count[N@Flatten[ImageData[trainingData[[1, 1]]]], 0.]
(*inverted data*)
Count[Flatten[ImageData[1. - trainingData[[1, 1]]]], 0.]

Out[141]= 2

Out[142]= 608

To test this hypothesis we can make the original data more sparce to see if this improves the training. We make the data more sparce by including more 0 values.

In[170]:= AbsoluteTiming[
 itrain0 = Map[ImageData[First[#]] -> Last[#] &, trainingData];
 itest0 = Map[ImageData[First[#]] -> Last[#] &, testData];
 ]

AbsoluteTiming[
 itrain0S = 
  Map[Clip[(ImageData[First[#]] - 0.5), {0, 1}] -> Last[#] &, 
   trainingData];
 itest0S = 
  Map[Clip[(ImageData[First[#]] - 0.5), {0, 1}] -> Last[#] &, 
   testData];
 ]

(*Original data*)
Count[N@Flatten[Clip[ImageData[trainingData[[1, 1]]], {0, 1}]], 0.]
(*Sparse original data*)
Count[N@Flatten[
   Clip[ImageData[trainingData[[1, 1]]] - 0.5, {0, 1}]], 0.]

Out[170]= {1.23638, Null}

Out[171]= {1.81911, Null}

Out[172]= 2

Out[173]= 125

So lets see if it helps

In[144]:= 
t0 = NetTrain[n2, itrain0, All, BatchSize -> 500, 
   MaxTrainingRounds -> 4, TargetDevice -> "GPU"];
t0["TotalTrainingTime"]
ClassifierMeasurements[t0["TrainedNet"], itest0, "Accuracy"]

Out[145]= 3.93681

Out[146]= 0.9053

In[167]:= 
t0S = NetTrain[n2, itrain0S, All, BatchSize -> 500, 
   MaxTrainingRounds -> 4, TargetDevice -> "GPU"];
t0S["TotalTrainingTime"]
ClassifierMeasurements[t0S["TrainedNet"], itest0S, "Accuracy"]

Out[168]= 2.6613

Out[169]= 0.9109

So the less information the images contains the less is needed to describe the data so the easier it becomes for a NN to figure out what is going on.

I attached the notebook so others can test this.

POSTED BY: Martijn Froeling

gus s

Posted 6 years ago

Thanks a lot. Very useful stuff. Additional to the big effect of Image-NetEncoder as you highlighted, the initial FlattenLayer also seems to be ~10% slower than a direct call to Flatten. So if I put this together I get on my machine: In[4]:= nn = NetChain[{64, Ramp, 10, SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> 28*28] Out[4]= NetChain[ <> ] In[5]:= AbsoluteTiming[ itrain3 = (1. - Flatten@ImageData@First@#) -> Last@# & /@ trainingData; itest3 = (1. - Flatten@ImageData@First@#) -> Last@# & /@ testData;] Out[5]= {1.59009, Null} In[6]:= AbsoluteTiming[ tn = NetTrain[nn, itrain3, BatchSize -> 100, MaxTrainingRounds -> 4]] Out[6]= {5.67012, NetChain[ <> ]} In[7]:= ClassifierMeasurements[tn, itest3, "Accuracy"] Out[7]= 0.9652 Training is now faster than under Keras/Tensorflow! Accuracy about even. I like that, because I prefer working with MMA to Python/IPython/Jupyter/Keras. Relating the inversion, I was first puzzled, that a bijective mapping, which leaves the entropy unchanged, could have any effect at all. But it seems plausible, that the zero plays its special role by making terms vanish and simplifying the computation. So maybe Wolfram should change the curated MNIST data to its original form with a zero background, which would make things comparable to other systems. Also if the Image-NetEncoder cannot made faster, at least a "Possible Issues"-Section under the NetEncoder help would be nice.

POSTED BY: gus s

Martijn Froeling

Martijn Froeling, University Medical Center Utrecht

Posted 6 years ago

I talked to someone with more knowledge on this topic than i have and after a nice grin he told me that is why batch normalization exists. In[4]:= nn = NetChain[{64, Ramp, 10, SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> 2828]; nnn = NetChain[{BatchNormalizationLayer[], 64, Ramp, 10, SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> 2828]; AbsoluteTiming[ itrain3 = (1. - Flatten@ImageData@First@#) -> Last@# & /@ trainingData; itest3 = (1. - Flatten@ImageData@First@#) -> Last@# & /@ testData; ] AbsoluteTiming[ itrain2 = (Flatten@ImageData@First@#) -> Last@# & /@ trainingData; itest2 = (Flatten@ImageData@First@#) -> Last@# & /@ testData; ] Out[6]= {1.10475, Null} Out[7]= {0.812343, Null} In[8]:= tr2 = NetTrain[nn, itrain2, All, BatchSize -> 500, MaxTrainingRounds -> 20, TargetDevice -> "GPU"]; tr2["TotalTrainingTime"] ClassifierMeasurements[tr2["TrainedNet"], itest2, "Accuracy"] Out[9]= 10.1913 Out[10]= 0.9352 In[11]:= tr2n = NetTrain[nnn, itrain2, All, BatchSize -> 500, MaxTrainingRounds -> 20, TargetDevice -> "GPU"]; tr2n["TotalTrainingTime"] ClassifierMeasurements[tr2n["TrainedNet"], itest2, "Accuracy"] Out[12]= 10.2112 Out[13]= 0.9709 In[14]:= tr3 = NetTrain[nn, itrain3, All, BatchSize -> 500, MaxTrainingRounds -> 20, TargetDevice -> "GPU"]; tr3["TotalTrainingTime"] ClassifierMeasurements[tr3["TrainedNet"], itest3, "Accuracy"] Out[15]= 10.0675 Out[16]= 0.9732 In[17]:= tr3n = NetTrain[nnn, itrain3, All, BatchSize -> 500, MaxTrainingRounds -> 20, TargetDevice -> "GPU"]; tr3n["TotalTrainingTime"] ClassifierMeasurements[tr3n["TrainedNet"], itest3, "Accuracy"] Out[18]= 10.2946 Out[19]= 0.9734

I talked to someone with more knowledge on this topic than i have and after a nice grin he told me that is why batch normalization exists.

In[4]:= nn = 
  NetChain[{64, Ramp, 10, SoftmaxLayer[]}, 
   "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> 28*28];
nnn = NetChain[{BatchNormalizationLayer[], 64, Ramp, 10, 
    SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], 
   "Input" -> 28*28];

AbsoluteTiming[
 itrain3 = (1. - Flatten@ImageData@First@#) -> Last@# & /@ 
   trainingData;
 itest3 = (1. - Flatten@ImageData@First@#) -> Last@# & /@ testData;
 ]

AbsoluteTiming[
 itrain2 = (Flatten@ImageData@First@#) -> Last@# & /@ trainingData;
 itest2 = (Flatten@ImageData@First@#) -> Last@# & /@ testData;
 ]

Out[6]= {1.10475, Null}

Out[7]= {0.812343, Null}

In[8]:= tr2 = 
  NetTrain[nn, itrain2, All, BatchSize -> 500, 
   MaxTrainingRounds -> 20, TargetDevice -> "GPU"];
tr2["TotalTrainingTime"]
ClassifierMeasurements[tr2["TrainedNet"], itest2, "Accuracy"]

Out[9]= 10.1913

Out[10]= 0.9352

In[11]:= tr2n = 
  NetTrain[nnn, itrain2, All, BatchSize -> 500, 
   MaxTrainingRounds -> 20, TargetDevice -> "GPU"];
tr2n["TotalTrainingTime"]
ClassifierMeasurements[tr2n["TrainedNet"], itest2, "Accuracy"]

Out[12]= 10.2112

Out[13]= 0.9709

In[14]:= tr3 = 
  NetTrain[nn, itrain3, All, BatchSize -> 500, 
   MaxTrainingRounds -> 20, TargetDevice -> "GPU"];
tr3["TotalTrainingTime"]
ClassifierMeasurements[tr3["TrainedNet"], itest3, "Accuracy"]

Out[15]= 10.0675

Out[16]= 0.9732

In[17]:= tr3n = 
  NetTrain[nnn, itrain3, All, BatchSize -> 500, 
   MaxTrainingRounds -> 20, TargetDevice -> "GPU"];
tr3n["TotalTrainingTime"]
ClassifierMeasurements[tr3n["TrainedNet"], itest3, "Accuracy"]

Out[18]= 10.2946

Out[19]= 0.9734

POSTED BY: Martijn Froeling

gus s

Posted 6 years ago

Interesting. BatchNormalizationLayer[] adds 3136 new learnable Parameters to the net: In[8]:= NetInformation[NetExtract[nnn, 1], "ArraysElementCounts"] Total@Values@% Out[8]= <\|{"Biases"} -> 784, {"MovingMean"} -> 784, {"MovingVariance"} -> 784, {"Scaling"} -> 784\|> Out[9]= 3136 That looks to me as a kind of dynamic learnable scaling during training, compared to the static scaling done by inverting the data before training. It improves the learning with the inverted data while not improving but also not hurting the learning with the original data. Mathematica Help says: "BatchNormalizationLayer is commonly inserted between a ConvolutionLayer and its activation function in order to stabilize and speed up training", which doesn't apply here. Also the AI course of Andrew Ng stresses usage in intermediate layers, which also isn't the case here. So I wonder, if there are general rules for applying a BatchNormalizationLayer or doing some kind of static scaling, or if it is experience and intuition.

POSTED BY: gus s

Arnoud Buzing

Arnoud Buzing, Wolfram Research

Posted 6 years ago

I don't have an explanation, but color inversion might be faster with this: Map[ColorNegate[First[#]] -> Last[#] &, trainingData]

POSTED BY: Arnoud Buzing

gus s

Posted 6 years ago

Thanks. That is 30x faster! In[46]:= AbsoluteTiming[ itrain = MapAt[1 - # &, trainingData, {All, 1}]; itest = MapAt[1 - # &, testData, {All, 1}];] Out[46]= {107.298, Null} In[47]:= AbsoluteTiming[ itrain1 = Map[ColorNegate[First[#]] -> Last[#] &, trainingData]; itest1 = Map[ColorNegate[First[#]] -> Last[#] &, testData];] Out[47]= {3.38229, Null} Training with the so processed Data: In[50]:= AbsoluteTiming[ t3 = NetTrain[n, itrain1, BatchSize -> 100, MaxTrainingRounds -> 4]] Out[50]= {19.329, NetChain[ <> ]} In[51]:= ClassifierMeasurements[t3, itest1, "Accuracy"] Out[51]= 0.9653 Accuracy is high as before, but training time has doubled again. That is another mystery for me: How can the differences between itrain and itrain1, which are only rounding differences, cause a doubling of the training time?

Thanks. That is 30x faster!

In[46]:= AbsoluteTiming[
 itrain = MapAt[1 - # &, trainingData, {All, 1}];
 itest = MapAt[1 - # &, testData, {All, 1}];]

Out[46]= {107.298, Null}

In[47]:= AbsoluteTiming[
 itrain1 = Map[ColorNegate[First[#]] -> Last[#] &, trainingData];
 itest1 = Map[ColorNegate[First[#]] -> Last[#] &, testData];]

Out[47]= {3.38229, Null}

Training with the so processed Data:

In[50]:= AbsoluteTiming[
 t3 = NetTrain[n, itrain1, BatchSize -> 100, MaxTrainingRounds -> 4]]

Out[50]= {19.329, NetChain[ <> ]}

In[51]:= ClassifierMeasurements[t3, itest1, "Accuracy"]

Out[51]= 0.9653

Accuracy is high as before, but training time has doubled again. That is another mystery for me: How can the differences between itrain and itrain1, which are only rounding differences, cause a doubling of the training time?

POSTED BY: gus s

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback