Message Boards Message Boards


Transfer an artistic style to an image

Posted 4 years ago
35 Replies
31 Total Likes

enter image description here


Back in Wolfram Summer School 2016 I worked on the project "Image Transformation with Neural Networks: Real-Time Style Transfer and Super-Resolution", which got later published on Wolfram Community. At the time I had to use the MXNetLink package, but now all the needed functionality is built-in, so here is a top-level implementation of artistic style transfer with Wolfram Language. This is a slightly simplified version of the original method, as it uses a single VGG layer to extract the style features, but a full implementation is of course possible with minor modifications to the code. You can also find this example in the docs:

NetTrain >> Applications >> Computer Vision >> Style Transfer


Create a new image with the content of a given image and in the style of another given image. This implementation follows the method described in Gatys et al., A Neural Algorithm of Artistic Style. An example content and style image:

enter image description here

To create the image which is a mix of both of these images, start by obtaining a pre-trained image classification network:

vggNet = NetModel["VGG-16 Trained on ImageNet Competition Data"];

Take a subnet that will be used as a feature extractor for the style and content images:

featureNet = Take[vggNet, {1, "relu4_1"}]

enter image description here

There are three loss functions used. The first loss ensures that the "content" is similar in the synthesized image and the content image:

contentLoss = NetGraph[{MeanSquaredLossLayer[]}, {1 -> NetPort["LossContent"]}]

enter image description here

The second loss ensures that the "style" is similar in the synthesized image and the style image. Style similarity is defined as the mean-squared difference between the Gram matrices of the input and target:

gramMatrix = NetGraph[{FlattenLayer[-1], TransposeLayer[1 -> 2],   DotLayer[]}, {1 -> 3, 1 -> 2 -> 3}];

styleLoss = NetGraph[{gramMatrix, gramMatrix, MeanSquaredLossLayer[]},
{NetPort["Input"] -> 1, NetPort["Target"] -> 2, {1, 2} -> 3,  3 -> NetPort["LossStyle"]}]

enter image description here

The third loss ensures that the magnitude of intensity changes across adjacent pixels in the synthesized image is small. This helps the synthesized image look more natural:

l2Loss = NetGraph[{ThreadingLayer[(#1 - #2)^2 &], SummationLayer[]}, {{NetPort["Input"], NetPort["Target"]} -> 1 -> 2}];

tvLoss = NetGraph[<|
   "dx1" -> PaddingLayer[{{0, 0}, {1, 0}, {0, 0}}, "Padding" -> "Fixed" ],
   "dx2" ->  PaddingLayer[{{0, 0}, {0, 1}, {0, 0}}, "Padding" -> "Fixed"],
   "dy1" ->  PaddingLayer[{{0, 0}, {0, 0}, {1, 0}}, "Padding" -> "Fixed" ],
   "dy2" ->  PaddingLayer[{{0, 0}, {0, 0}, {0, 1}}, "Padding" -> "Fixed"],
   "lossx" -> l2Loss, "lossy" -> l2Loss, "tot" -> TotalLayer[]|>,
 {{"dx1", "dx2"} -> "lossx", {"dy1", "dy2"} -> "lossy",
   {"lossx", "lossy"} -> "tot" -> NetPort["LossTV"]}]

enter image description here

Define a function that creates the final training net for any content and style image. This function also creates a random initial image:

createTransferNet[net_, content_Image, styleFeatSize_] := Module[{dims = Prepend[3]@Reverse@ImageDimensions[content]},
"Image" -> ConstantArrayLayer["Array" -> RandomReal[{-0.1, 0.1}, dims]],
"imageFeat" -> NetReplacePart[net, "Input" -> dims],
"content" -> contentLoss,
"style" -> styleLoss,
"tv" -> tvLoss|>,
{"Image" -> "imageFeat",
{"imageFeat", NetPort["ContentFeature"]} -> "content",
{"imageFeat", NetPort["StyleFeature"]} -> "style",
"Image" -> "tv"},
"StyleFeature" -> styleFeatSize   ] ]

Define a NetDecoder for visualizing the predicted image:

meanIm = NetExtract[featureNet, "Input"][["MeanImage"]]

{0.48502, 0.457957, 0.407604}

decoder = NetDecoder[{"Image", "MeanImage" -> meanIm}]

enter image description here

The training data consists of features extracted from the content and style images. Define a feature extraction function:

extractFeatures[img_] := NetReplacePart[featureNet, "Input" ->NetEncoder[{"Image", ImageDimensions[img], 
 "MeanImage" ->meanIm}]][img];

Create a training set consisting of a single example of a content and style feature:

trainingdata = <|
  "ContentFeature" -> {extractFeatures[contentImg]},
   "StyleFeature" -> {extractFeatures[styleImg]}

Create the training net whose input dimensions correspond to the content and style image dimensions:

net = createTransferNet[featureNet, contentImg, 

When training, the three losses are weighted differently to set the relative importance of the content and style. These values might need to be changed with different content and style images. Create a loss specification that defines the final loss as a combination of the three losses:

perPixel = 1/(3*Apply[Times, ImageDimensions[contentImg]]);
lossSpec = {"LossContent" -> Scaled[6.*10^-5], 
   "LossStyle" -> Scaled[0.5*10^-14], 
   "LossTV" -> Scaled[20.*perPixel]};

Optimize the image using NetTrain. LearningRateMultipliers are used to freeze all parameters in the net except for the ConstantArrayLayer. The training is best done on a GPU, as it will take up to an hour to get good results with CPU training. The training can be stopped at any time via Evaluation -> Abort Evaluation:

trainedNet = NetTrain[net,
  trainingdata, lossSpec,
  LearningRateMultipliers -> {"Image" -> 1, _ -> None},
  TrainingProgressReporting -> 
   Function[decoder[#Weights[{"Image", "Array"}]]],
  MaxTrainingRounds -> 300, BatchSize -> 1,
  Method -> {"ADAM", "InitialLearningRate" -> 0.05},
  TargetDevice -> "GPU"

enter image description here

Extract the final image from the ConstantArrayLayer of the trained net:

decoder[NetExtract[trainedNet, {"Image", "Array"}]]

enter image description here

35 Replies

That's spectacular. Do you have an online gallery of examples?

No, but you can find plenty of examples online. Just look for "neural style transfer".

Posted 4 years ago

I'm trying to follow your instructions, but that vgg file is massive, 4GB. Is that why you truncated it at relu4_1? The extracted featureNet is only 87MB.

Weird. The full VGG16 should be around 400-500MB, 4GB is way too much. Are you sure about this?

The reason why it's truncated to relu4_1 is not related to file size, it's because you want to extract features from a convolutional layer. Features from the LinearLayers at the end of the net are very specialized for classification (which is what VGG's were made for) and have discarded the information about style and content that we want to exploit.

Anyway, most of the parameters (and therefore the file size) of the VGG16 lie in the last few LinearLayers. So yes, by taking them out you get a much lighter model.

OK, 4GB is what it comes to when I saved it locally so I wouldn't have to download it every time. I guess Save["file.m",vggNet] isn't the way to go...

Well, definitely not! The proper thing to do is export to a .wlnet file, the specific file format for Wolfram Language neural nets.

Anyway, NetModel[] should cache the downloaded file to your local system somewhere, so that you don't have to download it every time. Just try calling the NetModel again, you should get the net much faster than the first evaluation.

But you also cut out a number of convolution layers, they go up to conv5_3.

Another question: why did you run the outputs of extractFeatures through gramMatrix before computing the losses? Did you try it without gramMatrix and get poor results?

I can imagine that the internal linear relationships in the feature matrix may be more important than the precise values, but it's not so obvious that's the case. What led you to that?

Later: Oh, I get it: the pics are not of the same size so this is only comparison possible.

But you also cut out a number of convolution layers, they go up to conv5_3.

Yes, that's because deeper convolutional layers start to behave as the final linear layers: too much information is discarded. Just try chopping the net to one of those, you will observe a very poor quality in the final result. On the other hand, with very shallow layers you only capture vaery basic features (like color) of the original style. For a single-layer simplified version like this, mid-deep layers are just the sweet spot.

Another question: why did you run the outputs of extractFeatures through gramMatrix before computing the losses? Did you try it without gramMatrix and get poor results?

I can imagine that the internal linear relationships in the feature matrix may be more important than the precise values, but it's not so obvious that's the case. What led you to that?

Later: Oh, I get it: the pics are not of the same size so this is only comparison possible.

Just to clarify: this algorithm was not invented by me! It was published in this paper, which started an interesting line of research on these methods. About your questions:

1 - If you look at the content loss, matching the features directly instead of their gram matrices will match content. So removing the gram matrix from play will result in the algorithm trying to match the content of both targets, i.e. you will effectively use two content losses and no style loss at all.

2 - As you also noticed, the features themselves contain a lot of spatial information, i.e. the look of the original picture at particular pixel values. If we want to capture the general style, we are not interested in that. The gram matrix is an effective way to disregard spatial information and only keep the correlation between channels. As you observe, this also allows to use content and style images of different sizes, but that's just a nice collateral effect.

3 - Beyond the observation of spatial information being discarded, the exact reason why gram matrices can effectively encode the style information has been a mystery for a while - no one really knew, they just worked. But some months ago this nice paper solved the mystery, recasting the style transfer problem to the problem of aligning the distributions of the features. Matching the gram matrices is just a particular alignment. In the paper the show different methods and compare the results.

Thanks for that, I think I'm getting it. It seems that by using the gram matrix we're effectively defing "style" as the pattern of contrasts and similarities in an image independent of content. It seems to work quite well for Impressionism.

I got everything to run on my computer, with the slight exception that when I try to use the GPU I get the following error.

An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /opt/Wolfram/Mathematica/11.1/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/ [0x7fd784130b76]
[bt] (1) /opt/Wolfram/Mathematica/11.1/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/ [0x7fd784159bef]
[bt] (2) /opt/Wolfram/Mathematica/11.1/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/ [0x7fd78415c730]
[bt] (3) /lib64/ [0x7fd7dae68970]
[bt] (4) /lib64/ [0x7fd7db5482e7]
[bt] (5) /lib64/ [0x7fd7da5cb54f]

I have successfully run other nets on the GPU. Maybe ADAM is the problem?

Is the Mathematica session crashing? If not, can you evaluate Internal`$LastInternalFailure right after the error and post the result?

No, the session didn't crash entirely, just the kernel. Do you think it might help to restart the session?

Internal`$LastInternalFailure evaluates only to itself, presumably becauswe the kernel was just restarted.

Btw, I'm on Arch Linux 64bit.

Actually what I posted earlier was not the full error message. The culprit seems to be lack of memory. My GPU is the GeForce GTX 750, I guess that doesn't have enough oomph. The GPU works OK with smaller images.

You can try to implement the original algorithm ( ), which extract style features from multiple VGG layers (while I've used only one layer in the example above)

OK, here's my first successful attempt.

enter image description here


enter image description here


enter image description here

Thank you Matteo for this wonderful toy!

What was your hardware set up for this post?

i7-4790 CPU, 4 cores, 16GB ram

The van Gogh Penguin was produced without using the GPU, my GeForce GTX 750 isn't good enough. Running time with the CPU was probably under an hour.

Trying myself on Windows 10 i7 Lenovo 260but get a couple of errors: enter image description here

Memory might be an issue, how much RAM do you have?

You might start off with small images so that's not an issue.

16 GB shouldn't be a problem

Yeah. But apparently it succeeded in computing the StyleFeature but not the ContentFeature. Is your contentImg larger than the styleImg? What are the ImageDimensions?

Thanks, I shrunk resolution, it seems to run now ! :)

Great, please post some images.

enter image description here


Cool, it even seems to have captured the disintegrating paint. Ulla makes a fine cherub.


Most of the style transfer images I have seen so far end up looking very similar - what I would describe as "Van Gough" style.

This image is very different - it looks as if you managed to capture the very different style of Raphael. Can you advise how you tweaked the settings of the algorithm?


Posted 4 years ago

Hi Matteo,

Interesting post. I try to run this code, but meet some problem with NetModel. I use Mathematica 11.1.1 on Windows 10 operation system.

I get error of the NetModel like below. Import::wlcorr: File is corrupt or is not a WLNet file.

Any suggestion to solve this problem? Thanks.

enter image description here

Posted 4 years ago

I think, I have download the WLNet file. But it's not working with Import.

enter image description here

Looks like your file got corrupted during download for some reason. Try to run ResourceRemove[ResourceObject["VGG-16 Trained on ImageNet Competition Data"]] to clear the file from you system and NetModel["VGG-16 Trained on ImageNet Competition Data"] to download it again.

Dear Matteo,

That's great work!

What is the difference between your code (your method) with another applications such as Prisma app ( for image transformation?


Well, no one knows exactly, as Prisma is not open source (to my knowledge). I also don't have direct experience o using prisma, but i guess it leverages the fast, feedforward-based methods. Those approached are about 100 times faster (or so) than this implementation (which is an optimization-based algorithm), although they generally provide lower quality results.

A lot of research has been made on these methods recently, so the algorithms are continously evolving.

If you are interested, you can check this very nice review about the current situation:

Posted 4 years ago

I have tried this example by this code

Mathematica graphics

But I get a bad result.You can find more example here.

Those examples are produced by a different, more complex algorithm, which is tailored to produce photorealistic transfers.

The algorithm i've presented lacks this feature, and is intended to produce "artistic" results instead of real-looking images. That's the original, first step in the world of neural style transfer. The code you linked is one of the many applications which build on top of that.

Dear Matteo,

We have an image of daily air temperature time series for duration of 2010-2016 (for example) .

Can we find the maximum or minimum values of this image using Mathematica?


So, is there like a "complete code" or a "workbook" version we can play with? As opposed to attempting to assemble it ourselves from scratch? ^_^

I'd love to play with this, at some point.

And can it only work with neural networks trained on a specific style image or am I misunderstanding and is the neural network you download doing something else, and you can still specify an arbitrary "style" image as input? If the downloaded neural network isn't trained on a specific "style," what's it's actual function? What is it "trained" on? Just wondering. Not super familiar with it...

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract