In the most highly cited neural network paper of all time, "Learning internal representations by error propagation" by Rumelhart, Hinton and Williams (RHW), the new "back-prop" method is applied to a 2-layer auto-encoder that when trained should encode (and decode) 1-hot vectors of length 2^n into binary codes of length n (page 14).
Since RHW use sigmoid activation and mean-squared loss, I specified the network and data as
n = 3;
auto = NetChain[{LinearLayer[n], ElementwiseLayer[LogisticSigmoid],
LinearLayer[2^n], ElementwiseLayer[LogisticSigmoid]},
"Input" -> 2^n];
data = Map[# -> # &, IdentityMatrix[2^n]];
I chose n=3 because RHW give results for this case.
Next, I use the default training:
auto = NetTrain[auto, data];
But the results are not as nice as in RHW (look at third line):
TableForm[Round[Map[auto, IdentityMatrix[2^n]], .001]]
{
{0.774, 0.006, 0.002, 0.01, 0.095, 0.001, 0.1, 0.022},
{0.001, 0.833, 0.011, 0., 0.023, 0.002, 0.095, 0.},
{0., 0.021, 0.418, 0., 0.079, 0.38, 0., 0.004},
{0.008, 0., 0., 0.923, 0., 0.001, 0.025, 0.021},
{0.034, 0.074, 0.076, 0., 0.808, 0.019, 0.002, 0.002},
{0., 0.004, 0.465, 0., 0.01, 0.606, 0., 0.017},
{0.14, 0.131, 0., 0.059, 0.001, 0., 0.783, 0.},
{0.022, 0., 0.013, 0.063, 0.001, 0.051, 0., 0.939}
}
To get the encoder stage I do this:
encode = NetTake[auto, 2];
The resulting binary codes are not great:
TableForm[Round[Map[encode, IdentityMatrix[2^n]], .001]]
{
{0.002, 0.009, 0.287},
{0.998, 0.004, 0.836},
{0.996, 0.686, 0.004},
{0.006, 0.992, 0.993},
{0.591, 0.001, 0.001},
{0.995, 0.988, 0.006},
{0.301, 0.003, 0.997},
{0.004, 0.951, 0.002}
}
Table 5 of RHW shows the 8 codes for n=3 and all the numbers are 0, 1 and 0.5, so nothing like I obtained with the Mma implementation. Are there options to try to make the results more in line with RHW? RHW used simple gradient descent because the stochastic variant hadn't been invented yet. But with default training in this example the batch size is given as 8 (all the data), so not stochastic in the usual sense of generating the gradient from random subsets of the data. What is a good method, in the Mma implementation, to further reduce the loss?