Message Boards Message Boards

Automatically sliding a conv net onto a larger image

How to control the step size of the following conv net as it slides onto a larger image?

See also: https://mathematica.stackexchange.com/questions/144060/sliding-fullyconvolutional-net-over-larger-images/148033

As a toy example, I'd like to slide a digit classifier trained on 28x28 images to classify each neighborhood of a larger image. This is lenet with linear layers replaced by 1x1 convolutional layers.

trainingData = ResourceData["MNIST", "TrainingData"];
testData = ResourceData["MNIST", "TestData"];

lenetModel = 
  NetModel["LeNet Trained on MNIST Data", 
   "UninitializedEvaluationNet"];

newlenet = NetExtract[lenetModel, All];
newlenet[[7]] = ConvolutionLayer[500, {4, 4}];
newlenet[[8]] = ElementwiseLayer[Ramp];
newlenet[[9]] = ConvolutionLayer[10, 1];
newlenet[[10]] = SoftmaxLayer[1];
newlenet[[11]] = PartLayer[{All, 1, 1}];

newlenet = 
 NetChain[newlenet, 
  "Input" -> 
   NetEncoder[{"Image", {28, 28}, ColorSpace -> "Grayscale"}]]

Now train it:

newtd = First@# -> UnitVector[10, Last@# + 1] & /@ trainingData;
newvd = First@# -> UnitVector[10, Last@# + 1] & /@ testData;

ng = NetGraph[
  <|"inference" -> newlenet,
   "loss" -> CrossEntropyLossLayer["Probabilities", "Input" -> 10]
   |>,
  {
   "inference" -> NetPort["loss", "Input"],
   NetPort["Target"] -> NetPort["loss", "Target"]
   }
  ]
tnew = NetTrain[ng, newtd, ValidationSet -> newvd, 
  TargetDevice -> "GPU"]

Now remove dimensions information (see stackexchange for the code definition of removeInputInformation):

removeInputInformation[layer_ConvolutionLayer] := 
 With[{k = NetExtract[layer, "OutputChannels"], 
   kernelSize = NetExtract[layer, "KernelSize"], 
   weights = NetExtract[layer, "Weights"], 
   biases = NetExtract[layer, "Biases"], 
   padding = NetExtract[layer, "PaddingSize"], 
   stride = NetExtract[layer, "Stride"], 
   dilation = NetExtract[layer, "Dilation"]}, 
  ConvolutionLayer[k, kernelSize, "Weights" -> weights, 
   "Biases" -> biases, "PaddingSize" -> padding, "Stride" -> stride, 
   "Dilation" -> dilation]]

removeInputInformation[layer_PoolingLayer] := 
 With[{f = NetExtract[layer, "Function"], 
   kernelSize = NetExtract[layer, "KernelSize"], 
   padding = NetExtract[layer, "PaddingSize"], 
   stride = NetExtract[layer, "Stride"]}, 
  PoolingLayer[kernelSize, stride, "PaddingSize" -> padding, 
   "Function" -> f]]

removeInputInformation[layer_ElementwiseLayer] := 
 With[{f = NetExtract[layer, "Function"]}, ElementwiseLayer[f]]

removeInputInformation[x_] := x

tmp = NetExtract[NetExtract[tnew, "inference"], All];
n3 = removeInputInformation /@ tmp[[1 ;; -3]];
AppendTo[n3, SoftmaxLayer[1]];
n3 = NetChain@n3;

And the network n3 slides onto any larger input. However, note that it seems to slide with steps of 4. How could I make it take steps of 1 instead?

In[358]:= n3[RandomReal[1, {1, 28*10, 28}]] // Dimensions

Out[358]= {10, 64, 1}

In[359]:= BlockMap[Length, Range[28*10], 28, 4] // Length

Out[359]= 64
POSTED BY: Matthias Odisio
11 Replies

Hi Matthias,

The stride of 4 comes from the pooling layers

In[49]:= Map[NetExtract[n3, {#, "Stride"}] &, {3, 6}]
Out[49]= {{2, 2}, {2, 2}}

(then on each dimension, there is an "implicit" stride of 2 x 2 = 4)

You can easily make this stride bigger (by a multiplicative factor), by setting a stride > 1 in the top convolutional layer for example.

But reducing the stride is a bit "awkward". You could have a stride of 1 by setting the stride to 1 in both pooling layers (that is to say to remove them...). Then the model is not the same : if you remove (or change) one pooling layer, you "invalidate" the weights that are learned after.

So the only solution i see if you REALLY want a stride of 1 is running the same network 16 times (!) and interleaving the results to reconstitute the output. You can save some computation by not recomputing what comes before the first pooling layer (i.e. the first convolution and it's non-linearity). If you have fixed-size images, there is a way to put everything into a unique network, sharing layers with NetInsertSharedArrays, and using PartLayer to shift the image representations when you need to.

We should probably advertise

NetReplacePart[net, "Input" -> Automatic]

For LeNet you can do

NetReplacePart[
    NetDrop[
        NetModel["LeNet Trained on MNIST Data", EvaluationNet"],
        -5
    ],
"Input" -> Automatic]

and get

no size net

Thanks Jerome. So it's not feasible to realistically reduce the stride.

I hope this similar topic will also interest you. What about sliding this denoising autoencoder?

size = 157;

n1 = 32;
k = 5;

conv2[n_] := 
  NetChain[{ConvolutionLayer[n, k, "Stride" -> 2], 
    BatchNormalizationLayer[], ElementwiseLayer["ReLU"], 
    DropoutLayer[], ConvolutionLayer[n, k, "Stride" -> 2], 
    BatchNormalizationLayer[], ElementwiseLayer["ReLU"]}];

deconv2[n_] := 
  NetChain[{DeconvolutionLayer[n, k, "Stride" -> 2], 
    BatchNormalizationLayer[], ElementwiseLayer["ReLU"], 
    DropoutLayer[], DeconvolutionLayer[n/2, k, "Stride" -> 2], 
    BatchNormalizationLayer[], ElementwiseLayer["SoftSign"]}];

sum[] := NetChain[{TotalLayer["Inputs" -> 2]}];

constantPowerLayer[] := NetChain[{
   ElementwiseLayer[Log@Clip[#, {$MachineEpsilon, 1}] &],
   ConvolutionLayer[1, 1, "Biases" -> None, "Weights" -> {{{{1}}}}],
   ElementwiseLayer[Exp]}]

ddae = NetGraph[
  <|
   "bugworkaround" -> ElementwiseLayer[# &],
   "c12" -> conv2[n1],
   "c34" -> conv2[2*n1],

   "d12" -> deconv2[2*n1],
   "d34" -> 
    NetChain[{DeconvolutionLayer[n1, k, "Stride" -> 2], 
      BatchNormalizationLayer[], ElementwiseLayer["ReLU"], 
      DeconvolutionLayer[1, k, "Stride" -> 2], 
      BatchNormalizationLayer[], ElementwiseLayer["SoftSign"]}],

   "sum1" -> sum[],
   "sum2" -> NetChain[{sum[], constantPowerLayer[]}],

   "loss" -> MeanSquaredLossLayer[]
   |>,
  {
   "bugworkaround" -> 
    "c12" -> 
     "c34" -> 
      "d12" -> "sum1" -> "d34" -> "sum2" -> NetPort["loss", "Input"],
   "bugworkaround" -> "sum2",
   NetPort["Noisy"] -> "bugworkaround",
   "c12" -> "sum1",
   NetPort["Target"] -> NetPort["loss", "Target"]
   },
  "Noisy" -> 
   NetEncoder[{"Image", {size, size}, ColorSpace -> "Grayscale"}],
  "Target" -> 
   NetEncoder[{"Image", {size, size}, ColorSpace -> "Grayscale"}]
  ]

trained = NetTake[NetInitialize@ddae, {"bugworkaround", "sum2"}]

Now I "automatize" the input dimensions:

n3 = NetReplacePart[trained, "Noisy" -> Automatic];

This new network works fine if given same dimensions, but fails with larger input dimensions. Any idea how to fix this problem?

In[142]:= n3[RandomReal[1, {1, 157, 157}]] // Dimensions

Out[142]= {1, 157, 157}

In[143]:= n3[RandomReal[1, {1, 1570, 1570}]] // Dimensions

During evaluation of In[143]:= NetGraph::tyfail1: Inferred inconsistent value for output size of layer 4 of layer "d34".

Out[143]= {}
POSTED BY: Matthias Odisio

Ah yes. Thanks! This is used in my follow up question.

POSTED BY: Matthias Odisio

Cool, auto-encoder with a U-shape!

So here the problem is different. The thing is that there are some constraints on the input size to be able to match the same size after going through deconvolutions. There cannot be any border effect.

For instance if you give an input of 6, to a kernel of 5 with stride 2, you are going to lose one input, that's what I call border effect. Our framework allows this. But here, you really need to reconstitute the same size after deconvolutions, which you cannot do if you throw away some input features.

So your input has to be of size 61 + n * 16 where n is positive or null.

And you can see that the construction of "ddae" fails if you use a size that does not satisfy this constraint (such as 1570). It's not a problem from changing the dimensions.

So try it with images with size equal to 157 modulo 16 (and not lower than 61)!

Thanks for this elaboration, Jerome.

Are those constraints brought by an underlying third party implementation? It feels like a bug that ConvolutionLayer and DeconvolutionLayer do not interplay well. I know the documentation does not claim otherwise. From my end-user's perspective those so-called "border effects" should be taken care of by the framework.

POSTED BY: Matthias Odisio

Indeed for this auto-encoder use case, some "smart" padding when needed could solve the issue you have with auto-encoders.

The support to more forms of padding is in the pipeline to improve the WL framework. We already unlocked some things around padding in the last version. There is a great chance that we will offer support for automatic padding to a constraint of type "m + k * n" in the next version, or another user-friendly solution for efficient support for multi dynamic dimensions.

Waiting for this, you can use the cheap solution of padding input images to fit size 61 + n . 16. It can be done using PaddingLayer, for example. You need to prepend this layer to the network for a given image size. It's bit awkward, but you will have no overhead with respect to the current situation, where the size inference and the unrolling of the net is done at top-level each time you apply the network. Again, we will improve how things work for multiple variable dimensions in the next versions.

Thanks, this is an acceptable workaround.

May I ask how you derive this formula, (61+16*n)?

By the way, a set of parenthesis is missing in ref/ConvolutionLayer's notes for the output size formula. The Property example gives the correct result.

And, I take good note that "the future will be better."

POSTED BY: Matthias Odisio

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: Moderation Team

May I ask how you derive this formula, (61+16*n)?

The output length of a convolutional or pooling layer, for a given size of kernel and stride, depending on the input length is this function:

layerOutputLength[kernel_, stride_][inputLength_] := (inputLength - kernel)/stride + 1;

By inverting this, you get the input length depending on the output length:

layerInputLength[kernel_, stride_][outputLength_] := stride * (outputLength - 1) + kernel;

There is 4 times layers with kernel size 5 and stride 2 in your auto-encoder. So the input size corresponding to a length 1 in the upper level (where the image dimension is the smallest) is obtained by computing 4 times inputLength[5, 2] starting from outputLength= 1:

netInputLength[outputLength_] := Nest[
    layerInputLength[5, 2],
    outputLength,
    4
];

netInputLength[1]
Out[4]= 61

This is the minimal input length, so that nothing is lost, and the length is 1 in the "most narrow part" of the network.

Then the "global stride" is just the multiplication of all the strides, so 2^4 = 16.

This value of the global stride can be check by looking how bigger the input length must be to produce a "most narrow part" of length +1:

netInputLength[2] - netInputLength[1]
netInputLength[3] - netInputLength[2]
netInputLength[4] - netInputLength[3]
Out[5]= 16
Out[6]= 16
Out[7]= 16

You can check all these equations by drawing what happens on a piece of paper =)

Thanks for taking the time to elaborate these details. Merci !

POSTED BY: Matthias Odisio
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract