Group Abstract Group Abstract

Message Boards Message Boards

Mathematica port of RetinaNet/SSDMobileNet etc Object Detectors

Posted 6 years ago
POSTED BY: Julian Francis
7 Replies
Posted 6 years ago

Hi Julian

My working (and optimistic) hypothesis training would be started from a pre-trained Yolo net and that only the final convolution layer would be trained (this still isn't trivial, it has 11e6 parameters) - how well this works in practice is to be discovered... My thinking was that the final convolution layer is preceded by what is, in a sense, a generic image CNN feature vector generator (i.e. one could substitute another e.g. ResNet to perform a similar job), so it's the final convolution layer that does most if not all of the work for the regression stages that follow it.

I'll need a week or two to find time to experiment with and understand your CZ code, anyway, happy to take this discussion offline, I'll send an email separately

-Steve

POSTED BY: Steve Walker
Posted 6 years ago
POSTED BY: Julian Francis
Posted 6 years ago

Hi Julian

Thanks for your helpful reply. I think there is something analogous to what you describe in the Yolo loss function whereby one anchor box (out of 5) per grid cell is assigned to be "responsible" for a prediction, I had difficulty in finding a way to implement this (I put a question to MSE about it). I'll to go back and think some more about the behaviour you've described and how Yolo's loss function behaves, possibly it's also related to the focal loss idea introduced with RetinaNet.

The quantity of training data needed is not a question I've been able to answer yet, I was hoping transfer learning and freezing as much as possible of the pre-trained net together with "aggresive" image augmentation would be sufficient given a few hundred original images but I don't know yet - there is more experimentation I need to do. I have some other slightly more speculative ideas about synthesising more training examples by other methods I'll eventually get to

-Steve

POSTED BY: Steve Walker
Posted 6 years ago

Hi Steve,

Yes, you are right RetinaNet does draw quite heavily from the FPN architecture.

Just to clarify my comment on masking loss:

RetinaNet has 8,732 anchor boxes arranged in different locations sizes and aspect ratios in the image.

Let's take an arbitrary anchor box, and assume that for this anchor box there are no objects in the image that have a high intersection over union with that anchor box. Now this anchor box makes 80 classification type of predictions, one for each class, and it seems fairly clear what the target should be, ie it should be 0 for each of those classifier outputs. But each anchor box also makes 4 regression type predictions that makes small adjustments to that anchor boxes default shape and position in order to better fit the object in the image. Here it is far less clear what the output should be. There is no object present to try and match against. It isn't clear that you could just set it to the defaults, but that's not right as the net may be outputting something different, and this will interfere with it. I think the general answer is that whatever the net is predicting for that regression box should be ignored, so there should be no back propagation for a loss signal for a regression box where that anchor box does not correspond to an object in the image. By "masking" I just mean a binary array (corresponding to the anchor boxes) should be passed into the regression loss function indicating say 1 means object present, please backpropogate regression loss and 0 meaning nothing present ignore the regression box output.

This sounds like a very interesting application. Do you have enough labelled data to train it?

I have successfully trained smaller nets before, but I've never really attempted to train any of the larger nets before, but I'll take a look at the approach and see if it is implementable. I'd be inclined to try it out on a smaller net like TinyYolo first to determine feasibility.

Hope this is helpful.

Kind regards, Julian.

POSTED BY: Julian Francis
Posted 6 years ago

Hi Julian

Thanks for your reply. It was certainly non-trivial to construct the training version of Yolo2, I confess it gave me quite a lot of trouble and I'm still not completely convinced it's matches the authors' implementations. I haven't read through the RetinaNet paper closely, I'll try and do that soon, so I can't really offer a meaningful opinion on your intriguing idea of a more elegant, generalised way of handling training of detection networks.

From a superficial reading it looks like RetinaNet draws heavily from the Feature Pyramid Net architecture and quite different to Yolo design, so a bit of reading and thinking on my part is needed to understand it better. By the way, by "masking loss" do you mean an intersection over union measure?

The applications I have in mind really are are for object classes that aren't in the COCO classes - e.g. industrial components or telecommunications equipment - hence the desire to train an object detection net to specialise further.

I'll download your CZ repo, take a look at what's in it and get back to you once I've got a better understanding of RetinaNet and its forebears, hopefully that'll be quite soon.

Thanks

-Steve

POSTED BY: Steve Walker
Posted 6 years ago
POSTED BY: Steve Walker
Posted 6 years ago

Hi Steve,

Thanks for your interest. No, I'm sorry I don't have code for training a RetinaNet.

In the DataConverters folder there is some code for reading and parsing COCO files.

The Experimental folder has got code that would be useful. For example there is code to implement the Alpha weighted Focus loss in RetinaNet. And I do have other neural nets that use a masked based loss backpropogation that you would need to ensure that only the loss for detected objects are backpropogated through the bounding box regression layer.

There is other code in Experimental for other neural nets which does things like map bounding boxes back onto an array, which you would need to build the targets for the neural net.

The code in the Experimental folders is a little bit scruffy (I'd be happy to tidy some of it up if of interest). So I think there is all the functionality that would be needed, but it would need to be all assembled together and made to fit the target output specifically for RetinaNet. I am sure it can be done, but it is not a completely trivial exercise.

Thinking about it, I suspect that the really smart way of doing it would be not to try and train all the different multiple resolution structures, but just to train straight of the 2 output ports, ie the Classes and Boxes ports. It seems too easy, but I can't think of a reason at the moment why it would not work. That would save a lot of trouble of matching the retina net structure. Might also have the benefit that that methodology could be portable to training other object detection nets. You'd still need to write that masking loss function (but again that could be portable across net architectures).

I think I started writing this thinking it is a significant amount of work, but if it is possible to ignore all the internal structure, then as a bonus that approach would work for all the nets. Interesting thought.

There is still the issue of choosing sensible learning rate and learning schedules, which can be quite a tricky area.

I'd be very happy to help if you'd like to take it further. Do you have specific application areas in mind?

With kind regards, Julian.

POSTED BY: Julian Francis
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard