Message Boards Message Boards

4 Replies
2 Total Likes
View groups...
Share this post:

How can I reproduce a historical neural network result?

Posted 1 year ago

In the most highly cited neural network paper of all time, "Learning internal representations by error propagation" by Rumelhart, Hinton and Williams (RHW), the new "back-prop" method is applied to a 2-layer auto-encoder that when trained should encode (and decode) 1-hot vectors of length 2^n into binary codes of length n (page 14).

Since RHW use sigmoid activation and mean-squared loss, I specified the network and data as

n = 3;
auto = NetChain[{LinearLayer[n], ElementwiseLayer[LogisticSigmoid], 
   LinearLayer[2^n], ElementwiseLayer[LogisticSigmoid]}, 
  "Input" -> 2^n];

data = Map[# -> # &, IdentityMatrix[2^n]];

I chose n=3 because RHW give results for this case.

Next, I use the default training:

auto = NetTrain[auto, data];

But the results are not as nice as in RHW (look at third line):

TableForm[Round[Map[auto, IdentityMatrix[2^n]], .001]]

 {0.774, 0.006, 0.002, 0.01, 0.095, 0.001, 0.1, 0.022},
 {0.001, 0.833, 0.011, 0., 0.023, 0.002, 0.095, 0.},
 {0., 0.021, 0.418, 0., 0.079, 0.38, 0., 0.004},
 {0.008, 0., 0., 0.923, 0., 0.001, 0.025, 0.021},
 {0.034, 0.074, 0.076, 0., 0.808, 0.019, 0.002, 0.002},
 {0., 0.004, 0.465, 0., 0.01, 0.606, 0., 0.017},
 {0.14, 0.131, 0., 0.059, 0.001, 0., 0.783, 0.},
 {0.022, 0., 0.013, 0.063, 0.001, 0.051, 0., 0.939}

To get the encoder stage I do this:

encode = NetTake[auto, 2];

The resulting binary codes are not great:

TableForm[Round[Map[encode, IdentityMatrix[2^n]], .001]]

 {0.002, 0.009, 0.287},
 {0.998, 0.004, 0.836},
 {0.996, 0.686, 0.004},
 {0.006, 0.992, 0.993},
 {0.591, 0.001, 0.001},
 {0.995, 0.988, 0.006},
 {0.301, 0.003, 0.997},
 {0.004, 0.951, 0.002}

Table 5 of RHW shows the 8 codes for n=3 and all the numbers are 0, 1 and 0.5, so nothing like I obtained with the Mma implementation. Are there options to try to make the results more in line with RHW? RHW used simple gradient descent because the stochastic variant hadn't been invented yet. But with default training in this example the batch size is given as 8 (all the data), so not stochastic in the usual sense of generating the gradient from random subsets of the data. What is a good method, in the Mma implementation, to further reduce the loss?

POSTED BY: Veit Elser
4 Replies
Posted 1 year ago

Thanks Joshua! I was hasty when reading the guide and didn't see that sigmoids in the final layer changed the default loss. I should have guessed this when the progress monitor started plotting "error" in addition to the loss.

In the meantime I had also found that increasing MaxTrainingRounds reduced the loss substantially, with the result that the autoencoder accuracy became quite good. On the other hand, the codes generated by the encoder stage of the well-trained autoencoder still deviated from binary codes, exactly as reported by RHW. I repeated those experiments with the properly implemented mean-squared loss (following RHW). The resulting codes were non-binary to about the same extent.

POSTED BY: Veit Elser

You should not expect to get the same learned weights (and resulting hidden unit patterns) as they achieve, as it is not unique (you can always do some matched linear transformation of the learned weights) RHW refer to this indirectly in the paper:

It is of some interest that the system that the system employed its ability to use intermediate values in solving this problem. It could, of course, have found a solution in which the hidden units took on only the values of zero and one. Often it does just that, but in this instance, and many others, there are solutions that use the intermediate values, and the learning system finds them even though it has a bias toward extreme values.

POSTED BY: Joshua Schrier
Posted 1 year ago

I did not expect to get the same learned weights or values on the hidden nodes, because these things depend on how the network is seeded (using RandomSeeding -> Automatic I get different results). What intrigued me about Table 5 of RHW was the fact that the codes seemed to be ternary, with values 0, 0.5 and 1 (0.5 is a point of symmetry for the sigmoid function). After my historical reenactment of the experiment I believe RHW may have rounded the values in Table 5 to the nearest half. I also repeated the experiment with n=4 and again there was a broad distribution of intermediate values (not just 0.5).

POSTED BY: Veit Elser

A few salient points:

You state that you wish to use "mean-squared loss". However, your code does not do this. When an Elementwise[LogisticSigmoid] is the last layer, NetTrain will default to using the CrossEntropyLoss (see ref/LossFunction) If you wish to use a different loss function that this default, set the LossFunction -> MeanSquaredLossLayer[] option in NetTrain.

You may need to increase the MaxTrainingRounds option in NetTrain. By default (if nothing else is specified) it will run a maximum of 10^4 batches, but this may not be enough. Try increasing this to converge the results. For example, using the default loss (CrossEntropyLoss), I found that the loss was still converging beyond this default; increasing MaxTrainingRounds -> 10^5 batches resulted in a well-converged results.

example loss convergence plot

POSTED BY: Joshua Schrier
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract