But you also cut out a number of convolution layers, they go up to
conv5_3.
Yes, that's because deeper convolutional layers start to behave as the final linear layers: too much information is discarded. Just try chopping the net to one of those, you will observe a very poor quality in the final result. On the other hand, with very shallow layers you only capture vaery basic features (like color) of the original style. For a single-layer simplified version like this, mid-deep layers are just the sweet spot.
Another question: why did you run the outputs of extractFeatures
through gramMatrix before computing the losses? Did you try it without
gramMatrix and get poor results?
I can imagine that the internal linear relationships in the feature
matrix may be more important than the precise values, but it's not so
obvious that's the case. What led you to that?
Later: Oh, I get it: the pics are not of the same size so this is only
comparison possible.
Just to clarify: this algorithm was not invented by me! It was published in this paper, which started an interesting line of research on these methods.
About your questions:
1 - If you look at the content loss, matching the features directly instead of their gram matrices will match content. So removing the gram matrix from play will result in the algorithm trying to match the content of both targets, i.e. you will effectively use two content losses and no style loss at all.
2 - As you also noticed, the features themselves contain a lot of spatial information, i.e. the look of the original picture at particular pixel values. If we want to capture the general style, we are not interested in that. The gram matrix is an effective way to disregard spatial information and only keep the correlation between channels. As you observe, this also allows to use content and style images of different sizes, but that's just a nice collateral effect.
3 - Beyond the observation of spatial information being discarded, the exact reason why gram matrices can effectively encode the style information has been a mystery for a while - no one really knew, they just worked. But some months ago this nice paper solved the mystery, recasting the style transfer problem to the problem of aligning the distributions of the features. Matching the gram matrices is just a particular alignment. In the paper the show different methods and compare the results.