Group Abstract Group Abstract

Message Boards Message Boards

Mathematica port of RetinaNet/SSDMobileNet etc Object Detectors

Posted 6 years ago

Dear all,

I have been maintaining a small Zoo of Computer Vision models for object detection. I have recently added the Google's SSDMobileNet and also Facebook's RetinaNet.

Functionally they perform the same task as ImageContents, but provide different accuracy/speed tradeoffs. The MobileNet is the state of the art fast object detector, and RetinaNet is a state of the art high accuracy object detector.

There are other models in the Zoo available, including the Single Shot Detectors. The Single Shot Detectors are the same as the ones available from the Wolfram Neural Net repository, that is not a coincidence, as they were submitted last year by me, and were accepted into the repository.

The models require Mathematica v12 to run.

Youtube RetinaNet Demo

Github repository link: CognitoZoo

I hope this is of interest to anyone who is using Mathematica for Computer Vision.

Thanks, Julian Francis.

POSTED BY: Julian Francis
7 Replies
Posted 6 years ago

Hi Julian

My working (and optimistic) hypothesis training would be started from a pre-trained Yolo net and that only the final convolution layer would be trained (this still isn't trivial, it has 11e6 parameters) - how well this works in practice is to be discovered... My thinking was that the final convolution layer is preceded by what is, in a sense, a generic image CNN feature vector generator (i.e. one could substitute another e.g. ResNet to perform a similar job), so it's the final convolution layer that does most if not all of the work for the regression stages that follow it.

I'll need a week or two to find time to experiment with and understand your CZ code, anyway, happy to take this discussion offline, I'll send an email separately

-Steve

POSTED BY: Steve Walker
Posted 6 years ago

Hi Steve,

I've just had a mother look at the training methodologies for the different neural nets. They do differ in terms of their exact matching strategies, i.e. how are the objects assigned to the anchor boxes (just one or several), and also how they define their loss functions.

I do think it is possible to build a generic framework where you can choose these decisions independently of the network you are training on. Some of the loss functions used can be quite quite complex, and you could choose exactly how closely you want to replace each authors training strategy. I am sure it will work, as to exactly how closely you have to replicate their training strategy to make it work really well is an open question.

Regarding your question about the single responsibility assignment, my approach has been.. You need to create a custom loss layer. It is just a net layer like any other, it has input ports and output ports. I pass in as an input an array of all the losses for every anchor box. I then pass in a mask array which selects which anchor boxes we have assigned. I simply multiply these input losses with the mask so now only the losses associated with the mask positions are part of the loss function. I then pass this out to a port called "Loss". When running NetTrain I specify that this is the loss function which should be minimised (using its LossFunction parameter).

See ./Experimental/Training/FocusLoss.m Look at MaskLossLayer Apologies for code in Experimental folder being a bit scruffy, its mostly random ideas I have been thinking about.

Just on a slightly cautious note, I am not sure whether a few hundred images is going to be enough to train these sort of nets? I think the Tiny Yolo net used about 20,000 labelled images, and COCO (for RetinaNet) used about 300,000 images? I am pretty sure they are both preinitialised on ImageNet (a few million images), so that's your transfer learning. I think TinyYolo used aggressive data augmentation (not sure about the others). I do not know this, but I suspect you may need a lot more labelled images.

If you would like to take this off forum, please do feel free to email me at julian.w.francis@gmail.com

Kind regards, Julian.

POSTED BY: Julian Francis
Posted 6 years ago

Hi Julian

Thanks for your helpful reply. I think there is something analogous to what you describe in the Yolo loss function whereby one anchor box (out of 5) per grid cell is assigned to be "responsible" for a prediction, I had difficulty in finding a way to implement this (I put a question to MSE about it). I'll to go back and think some more about the behaviour you've described and how Yolo's loss function behaves, possibly it's also related to the focal loss idea introduced with RetinaNet.

The quantity of training data needed is not a question I've been able to answer yet, I was hoping transfer learning and freezing as much as possible of the pre-trained net together with "aggresive" image augmentation would be sufficient given a few hundred original images but I don't know yet - there is more experimentation I need to do. I have some other slightly more speculative ideas about synthesising more training examples by other methods I'll eventually get to

-Steve

POSTED BY: Steve Walker
Posted 6 years ago

Hi Steve,

Yes, you are right RetinaNet does draw quite heavily from the FPN architecture.

Just to clarify my comment on masking loss:

RetinaNet has 8,732 anchor boxes arranged in different locations sizes and aspect ratios in the image.

Let's take an arbitrary anchor box, and assume that for this anchor box there are no objects in the image that have a high intersection over union with that anchor box. Now this anchor box makes 80 classification type of predictions, one for each class, and it seems fairly clear what the target should be, ie it should be 0 for each of those classifier outputs. But each anchor box also makes 4 regression type predictions that makes small adjustments to that anchor boxes default shape and position in order to better fit the object in the image. Here it is far less clear what the output should be. There is no object present to try and match against. It isn't clear that you could just set it to the defaults, but that's not right as the net may be outputting something different, and this will interfere with it. I think the general answer is that whatever the net is predicting for that regression box should be ignored, so there should be no back propagation for a loss signal for a regression box where that anchor box does not correspond to an object in the image. By "masking" I just mean a binary array (corresponding to the anchor boxes) should be passed into the regression loss function indicating say 1 means object present, please backpropogate regression loss and 0 meaning nothing present ignore the regression box output.

This sounds like a very interesting application. Do you have enough labelled data to train it?

I have successfully trained smaller nets before, but I've never really attempted to train any of the larger nets before, but I'll take a look at the approach and see if it is implementable. I'd be inclined to try it out on a smaller net like TinyYolo first to determine feasibility.

Hope this is helpful.

Kind regards, Julian.

POSTED BY: Julian Francis
Posted 6 years ago

Hi Julian

Thanks for your reply. It was certainly non-trivial to construct the training version of Yolo2, I confess it gave me quite a lot of trouble and I'm still not completely convinced it's matches the authors' implementations. I haven't read through the RetinaNet paper closely, I'll try and do that soon, so I can't really offer a meaningful opinion on your intriguing idea of a more elegant, generalised way of handling training of detection networks.

From a superficial reading it looks like RetinaNet draws heavily from the Feature Pyramid Net architecture and quite different to Yolo design, so a bit of reading and thinking on my part is needed to understand it better. By the way, by "masking loss" do you mean an intersection over union measure?

The applications I have in mind really are are for object classes that aren't in the COCO classes - e.g. industrial components or telecommunications equipment - hence the desire to train an object detection net to specialise further.

I'll download your CZ repo, take a look at what's in it and get back to you once I've got a better understanding of RetinaNet and its forebears, hopefully that'll be quite soon.

Thanks

-Steve

POSTED BY: Steve Walker
Posted 6 years ago
POSTED BY: Steve Walker
Posted 6 years ago
POSTED BY: Julian Francis
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard