Hello Community,
I am attempting to implement the YOLOv2 object-detection neural network loss function in Wolfram Language. Doing this will allow us to train YOLOv2 on our own datasets to build our own domain-specific real-time object detectors.
However, I am running into a strange error which seems to have only two other mentions on the entire internet.
NetTrain::arrdiv: Training was stopped early because one or more trainable parameters of the net diverged. As no ValidationSet was provided, the most recent net will be returned, which is likely to be unusable. To avoid divergence, ensure that the training data has been normalized to have zero mean and unit variance. You can also try specifying a lower learning rate, or use a different optimization method; the (possibly automatic) values used were Method->{ADAM,Beta1->0.9,Beta2->0.999,Epsilon->1/100000,GradientClipping->None,L2Regularization->None,LearningRate->Automatic,LearningRateSchedule->None,WeightClipping->None}, LearningRate->0.001. Alternatively, you can use the "GradientClipping" option to Method to bound the magnitude of gradients during training.
The message is somewhat vague, and I am not exactly sure what is causing it. If I set LearningRate->0, I still get this error. If I set Method -> {"SGD", "GradientClipping" -> 0}, I also get this error.
This is also followed by a trail of additional errors:
CompiledFunction::cfta: Argument {238.826,{0.}} at position 1 should be a rank 1 tensor of machine-size real numbers.
CompiledFunction::cfta: Argument {45.0069,{0.}} at position 1 should be a rank 1 tensor of machine-size real numbers.
CompiledFunction::cfta: Argument {161.463,{0.}} at position 1 should be a rank 1 tensor of machine-size real numbers.
General::stop: Further output of CompiledFunction::cfta will be suppressed during this calculation.
I have found through some trial and error that the error is present if I directly include part of my net (box iou calculations) in the loss function.
The code is not included directly here because it is not the focus of the discussion. Rather, I would like to know what NetTrain::arrdiv
means so I can get around this. I would also like to get this answered publicly so others who have this issue in the future may know what is wrong and how to fix it.
The current error message is not very helpful, since it does not really tell me what "diverging paramters" are and the troubleshooting recommendations are ineffective (the input is already normalized, and lowering the learning rate to 0.0 does not prevent the error from occurring).
Here is the full notebook if anyone wants to see the errors themselves.
Thanks for the help!