Message Boards Message Boards

Mathematica port of RetinaNet/SSDMobileNet etc Object Detectors

Posted 5 years ago

Dear all,

I have been maintaining a small Zoo of Computer Vision models for object detection. I have recently added the Google's SSDMobileNet and also Facebook's RetinaNet.

Functionally they perform the same task as ImageContents, but provide different accuracy/speed tradeoffs. The MobileNet is the state of the art fast object detector, and RetinaNet is a state of the art high accuracy object detector.

There are other models in the Zoo available, including the Single Shot Detectors. The Single Shot Detectors are the same as the ones available from the Wolfram Neural Net repository, that is not a coincidence, as they were submitted last year by me, and were accepted into the repository.

The models require Mathematica v12 to run.

Youtube RetinaNet Demo

Github repository link: CognitoZoo

I hope this is of interest to anyone who is using Mathematica for Computer Vision.

Thanks, Julian Francis.

POSTED BY: Julian Francis
7 Replies
Posted 5 years ago

Hi Julian

My working (and optimistic) hypothesis training would be started from a pre-trained Yolo net and that only the final convolution layer would be trained (this still isn't trivial, it has 11e6 parameters) - how well this works in practice is to be discovered... My thinking was that the final convolution layer is preceded by what is, in a sense, a generic image CNN feature vector generator (i.e. one could substitute another e.g. ResNet to perform a similar job), so it's the final convolution layer that does most if not all of the work for the regression stages that follow it.

I'll need a week or two to find time to experiment with and understand your CZ code, anyway, happy to take this discussion offline, I'll send an email separately

-Steve

POSTED BY: Steve Walker
Posted 5 years ago

Hi Steve,

I've just had a mother look at the training methodologies for the different neural nets. They do differ in terms of their exact matching strategies, i.e. how are the objects assigned to the anchor boxes (just one or several), and also how they define their loss functions.

I do think it is possible to build a generic framework where you can choose these decisions independently of the network you are training on. Some of the loss functions used can be quite quite complex, and you could choose exactly how closely you want to replace each authors training strategy. I am sure it will work, as to exactly how closely you have to replicate their training strategy to make it work really well is an open question.

Regarding your question about the single responsibility assignment, my approach has been.. You need to create a custom loss layer. It is just a net layer like any other, it has input ports and output ports. I pass in as an input an array of all the losses for every anchor box. I then pass in a mask array which selects which anchor boxes we have assigned. I simply multiply these input losses with the mask so now only the losses associated with the mask positions are part of the loss function. I then pass this out to a port called "Loss". When running NetTrain I specify that this is the loss function which should be minimised (using its LossFunction parameter).

See ./Experimental/Training/FocusLoss.m Look at MaskLossLayer Apologies for code in Experimental folder being a bit scruffy, its mostly random ideas I have been thinking about.

Just on a slightly cautious note, I am not sure whether a few hundred images is going to be enough to train these sort of nets? I think the Tiny Yolo net used about 20,000 labelled images, and COCO (for RetinaNet) used about 300,000 images? I am pretty sure they are both preinitialised on ImageNet (a few million images), so that's your transfer learning. I think TinyYolo used aggressive data augmentation (not sure about the others). I do not know this, but I suspect you may need a lot more labelled images.

If you would like to take this off forum, please do feel free to email me at julian.w.francis@gmail.com

Kind regards, Julian.

POSTED BY: Julian Francis
Posted 5 years ago

Hi Julian

Thanks for your helpful reply. I think there is something analogous to what you describe in the Yolo loss function whereby one anchor box (out of 5) per grid cell is assigned to be "responsible" for a prediction, I had difficulty in finding a way to implement this (I put a question to MSE about it). I'll to go back and think some more about the behaviour you've described and how Yolo's loss function behaves, possibly it's also related to the focal loss idea introduced with RetinaNet.

The quantity of training data needed is not a question I've been able to answer yet, I was hoping transfer learning and freezing as much as possible of the pre-trained net together with "aggresive" image augmentation would be sufficient given a few hundred original images but I don't know yet - there is more experimentation I need to do. I have some other slightly more speculative ideas about synthesising more training examples by other methods I'll eventually get to

-Steve

POSTED BY: Steve Walker
Posted 5 years ago

Hi Steve,

Yes, you are right RetinaNet does draw quite heavily from the FPN architecture.

Just to clarify my comment on masking loss:

RetinaNet has 8,732 anchor boxes arranged in different locations sizes and aspect ratios in the image.

Let's take an arbitrary anchor box, and assume that for this anchor box there are no objects in the image that have a high intersection over union with that anchor box. Now this anchor box makes 80 classification type of predictions, one for each class, and it seems fairly clear what the target should be, ie it should be 0 for each of those classifier outputs. But each anchor box also makes 4 regression type predictions that makes small adjustments to that anchor boxes default shape and position in order to better fit the object in the image. Here it is far less clear what the output should be. There is no object present to try and match against. It isn't clear that you could just set it to the defaults, but that's not right as the net may be outputting something different, and this will interfere with it. I think the general answer is that whatever the net is predicting for that regression box should be ignored, so there should be no back propagation for a loss signal for a regression box where that anchor box does not correspond to an object in the image. By "masking" I just mean a binary array (corresponding to the anchor boxes) should be passed into the regression loss function indicating say 1 means object present, please backpropogate regression loss and 0 meaning nothing present ignore the regression box output.

This sounds like a very interesting application. Do you have enough labelled data to train it?

I have successfully trained smaller nets before, but I've never really attempted to train any of the larger nets before, but I'll take a look at the approach and see if it is implementable. I'd be inclined to try it out on a smaller net like TinyYolo first to determine feasibility.

Hope this is helpful.

Kind regards, Julian.

POSTED BY: Julian Francis
Posted 5 years ago

Hi Julian

Thanks for your reply. It was certainly non-trivial to construct the training version of Yolo2, I confess it gave me quite a lot of trouble and I'm still not completely convinced it's matches the authors' implementations. I haven't read through the RetinaNet paper closely, I'll try and do that soon, so I can't really offer a meaningful opinion on your intriguing idea of a more elegant, generalised way of handling training of detection networks.

From a superficial reading it looks like RetinaNet draws heavily from the Feature Pyramid Net architecture and quite different to Yolo design, so a bit of reading and thinking on my part is needed to understand it better. By the way, by "masking loss" do you mean an intersection over union measure?

The applications I have in mind really are are for object classes that aren't in the COCO classes - e.g. industrial components or telecommunications equipment - hence the desire to train an object detection net to specialise further.

I'll download your CZ repo, take a look at what's in it and get back to you once I've got a better understanding of RetinaNet and its forebears, hopefully that'll be quite soon.

Thanks

-Steve

POSTED BY: Steve Walker
Posted 5 years ago

Hi Julian

I was very pleased to see your post, impressive work. Have you looked at creating a trainable version RetinaNet (or even have one already)?

I and a colleague worked on a trainable version of Yolo2 earlier in the year, based on the construction notebook version (very nicely prepared) available on the WRI repo but it was quite a lot of work and it needs yet more work before it's useable by others. The motivation for this was to be able to train it for other specialised detection tasks as well as experiment with the net architecture itself and similarly with RetinaNet

-Steve

POSTED BY: Steve Walker
Posted 5 years ago

Hi Steve,

Thanks for your interest. No, I'm sorry I don't have code for training a RetinaNet.

In the DataConverters folder there is some code for reading and parsing COCO files.

The Experimental folder has got code that would be useful. For example there is code to implement the Alpha weighted Focus loss in RetinaNet. And I do have other neural nets that use a masked based loss backpropogation that you would need to ensure that only the loss for detected objects are backpropogated through the bounding box regression layer.

There is other code in Experimental for other neural nets which does things like map bounding boxes back onto an array, which you would need to build the targets for the neural net.

The code in the Experimental folders is a little bit scruffy (I'd be happy to tidy some of it up if of interest). So I think there is all the functionality that would be needed, but it would need to be all assembled together and made to fit the target output specifically for RetinaNet. I am sure it can be done, but it is not a completely trivial exercise.

Thinking about it, I suspect that the really smart way of doing it would be not to try and train all the different multiple resolution structures, but just to train straight of the 2 output ports, ie the Classes and Boxes ports. It seems too easy, but I can't think of a reason at the moment why it would not work. That would save a lot of trouble of matching the retina net structure. Might also have the benefit that that methodology could be portable to training other object detection nets. You'd still need to write that masking loss function (but again that could be portable across net architectures).

I think I started writing this thinking it is a significant amount of work, but if it is possible to ignore all the internal structure, then as a bonus that approach would work for all the nets. Interesting thought.

There is still the issue of choosing sensible learning rate and learning schedules, which can be quite a tricky area.

I'd be very happy to help if you'd like to take it further. Do you have specific application areas in mind?

With kind regards, Julian.

POSTED BY: Julian Francis
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract