Message Boards Message Boards

0
|
4136 Views
|
2 Replies
|
1 Total Likes
View groups...
Share
Share this post:

Error "parameters of the net diverged" while training a Neural Net

Posted 3 years ago

Hello Community,

I am attempting to implement the YOLOv2 object-detection neural network loss function in Wolfram Language. Doing this will allow us to train YOLOv2 on our own datasets to build our own domain-specific real-time object detectors.

However, I am running into a strange error which seems to have only two other mentions on the entire internet.

NetTrain::arrdiv: Training was stopped early because one or more trainable parameters of the net diverged. As no ValidationSet was provided, the most recent net will be returned, which is likely to be unusable. To avoid divergence, ensure that the training data has been normalized to have zero mean and unit variance. You can also try specifying a lower learning rate, or use a different optimization method; the (possibly automatic) values used were Method->{ADAM,Beta1->0.9,Beta2->0.999,Epsilon->1/100000,GradientClipping->None,L2Regularization->None,LearningRate->Automatic,LearningRateSchedule->None,WeightClipping->None}, LearningRate->0.001. Alternatively, you can use the "GradientClipping" option to Method to bound the magnitude of gradients during training.

The message is somewhat vague, and I am not exactly sure what is causing it. If I set LearningRate->0, I still get this error. If I set Method -> {"SGD", "GradientClipping" -> 0}, I also get this error.

This is also followed by a trail of additional errors:

CompiledFunction::cfta: Argument {238.826,{0.}} at position 1 should be a rank 1 tensor of machine-size real numbers.

CompiledFunction::cfta: Argument {45.0069,{0.}} at position 1 should be a rank 1 tensor of machine-size real numbers.

CompiledFunction::cfta: Argument {161.463,{0.}} at position 1 should be a rank 1 tensor of machine-size real numbers.

General::stop: Further output of CompiledFunction::cfta will be suppressed during this calculation.

I have found through some trial and error that the error is present if I directly include part of my net (box iou calculations) in the loss function.

The code is not included directly here because it is not the focus of the discussion. Rather, I would like to know what NetTrain::arrdiv means so I can get around this. I would also like to get this answered publicly so others who have this issue in the future may know what is wrong and how to fix it.

The current error message is not very helpful, since it does not really tell me what "diverging paramters" are and the troubleshooting recommendations are ineffective (the input is already normalized, and lowering the learning rate to 0.0 does not prevent the error from occurring).

Here is the full notebook if anyone wants to see the errors themselves.

Thanks for the help!

POSTED BY: Alec Graves
2 Replies
Posted 3 years ago
Posted 3 years ago

Sooo, I managed to implement a workaround by thresholding the IOU values before comparing them in my loss function (this might cause the model's confidence prediction distribution to be shaped incorrectly).

I am still pretty lost as to why directly using IOU values in the calculation directly causes "parameters to diverge"...

Maybe this information would help someone more knowledgeable about the back-end and what that error message is implying.

yolo fine-tuned output

POSTED BY: Alec Graves
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract