GROUPS:

# Introduction

Back in Wolfram Summer School 2016 I worked on the project "Image Transformation with Neural Networks: Real-Time Style Transfer and Super-Resolution", which got later published on Wolfram Community. At the time I had to use the MXNetLink package, but now all the needed functionality is built-in, so here is a top-level implementation of artistic style transfer with Wolfram Language. This is a slightly simplified version of the original method, as it uses a single VGG layer to extract the style features, but a full implementation is of course possible with minor modifications to the code. You can also find this example in the docs:

NetTrain >> Applications >> Computer Vision >> Style Transfer

# Code

Create a new image with the content of a given image and in the style of another given image. This implementation follows the method described in Gatys et al., A Neural Algorithm of Artistic Style. An example content and style image:

To create the image which is a mix of both of these images, start by obtaining a pre-trained image classification network:

vggNet = NetModel["VGG-16 Trained on ImageNet Competition Data"];


Take a subnet that will be used as a feature extractor for the style and content images:

featureNet = Take[vggNet, {1, "relu4_1"}]


There are three loss functions used. The first loss ensures that the "content" is similar in the synthesized image and the content image:

contentLoss = NetGraph[{MeanSquaredLossLayer[]}, {1 -> NetPort["LossContent"]}]


The second loss ensures that the "style" is similar in the synthesized image and the style image. Style similarity is defined as the mean-squared difference between the Gram matrices of the input and target:

gramMatrix = NetGraph[{FlattenLayer[-1], TransposeLayer[1 -> 2],   DotLayer[]}, {1 -> 3, 1 -> 2 -> 3}];

styleLoss = NetGraph[{gramMatrix, gramMatrix, MeanSquaredLossLayer[]},
{NetPort["Input"] -> 1, NetPort["Target"] -> 2, {1, 2} -> 3,  3 -> NetPort["LossStyle"]}]


The third loss ensures that the magnitude of intensity changes across adjacent pixels in the synthesized image is small. This helps the synthesized image look more natural:

l2Loss = NetGraph[{ThreadingLayer[(#1 - #2)^2 &], SummationLayer[]}, {{NetPort["Input"], NetPort["Target"]} -> 1 -> 2}];

tvLoss = NetGraph[<|
"dx1" -> PaddingLayer[{{0, 0}, {1, 0}, {0, 0}}, "Padding" -> "Fixed" ],
"dy1" ->  PaddingLayer[{{0, 0}, {0, 0}, {1, 0}}, "Padding" -> "Fixed" ],
"lossx" -> l2Loss, "lossy" -> l2Loss, "tot" -> TotalLayer[]|>,
{{"dx1", "dx2"} -> "lossx", {"dy1", "dy2"} -> "lossy",
{"lossx", "lossy"} -> "tot" -> NetPort["LossTV"]}]


Define a function that creates the final training net for any content and style image. This function also creates a random initial image:

createTransferNet[net_, content_Image, styleFeatSize_] := Module[{dims = Prepend[3]@Reverse@ImageDimensions[content]},
NetGraph[<|
"Image" -> ConstantArrayLayer["Array" -> RandomReal[{-0.1, 0.1}, dims]],
"imageFeat" -> NetReplacePart[net, "Input" -> dims],
"content" -> contentLoss,
"style" -> styleLoss,
"tv" -> tvLoss|>,
{"Image" -> "imageFeat",
{"imageFeat", NetPort["ContentFeature"]} -> "content",
{"imageFeat", NetPort["StyleFeature"]} -> "style",
"Image" -> "tv"},
"StyleFeature" -> styleFeatSize   ] ]


Define a NetDecoder for visualizing the predicted image:

meanIm = NetExtract[featureNet, "Input"][["MeanImage"]]


{0.48502, 0.457957, 0.407604}

decoder = NetDecoder[{"Image", "MeanImage" -> meanIm}]


The training data consists of features extracted from the content and style images. Define a feature extraction function:

extractFeatures[img_] := NetReplacePart[featureNet, "Input" ->NetEncoder[{"Image", ImageDimensions[img],
"MeanImage" ->meanIm}]][img];


Create a training set consisting of a single example of a content and style feature:

trainingdata = <|
"ContentFeature" -> {extractFeatures[contentImg]},
"StyleFeature" -> {extractFeatures[styleImg]}
|>


Create the training net whose input dimensions correspond to the content and style image dimensions:

net = createTransferNet[featureNet, contentImg,
Dimensions@First@trainingdata["StyleFeature"]];


When training, the three losses are weighted differently to set the relative importance of the content and style. These values might need to be changed with different content and style images. Create a loss specification that defines the final loss as a combination of the three losses:

perPixel = 1/(3*Apply[Times, ImageDimensions[contentImg]]);
lossSpec = {"LossContent" -> Scaled[6.*10^-5],
"LossStyle" -> Scaled[0.5*10^-14],
"LossTV" -> Scaled[20.*perPixel]};


Optimize the image using NetTrain. LearningRateMultipliers are used to freeze all parameters in the net except for the ConstantArrayLayer. The training is best done on a GPU, as it will take up to an hour to get good results with CPU training. The training can be stopped at any time via Evaluation -> Abort Evaluation:

trainedNet = NetTrain[net,
trainingdata, lossSpec,
LearningRateMultipliers -> {"Image" -> 1, _ -> None},
TrainingProgressReporting ->
Function[decoder[#Weights[{"Image", "Array"}]]],
MaxTrainingRounds -> 300, BatchSize -> 1,
Method -> {"ADAM", "InitialLearningRate" -> 0.05},
TargetDevice -> "GPU"
]


Extract the final image from the ConstantArrayLayer of the trained net:

decoder[NetExtract[trainedNet, {"Image", "Array"}]]


2 months ago
34 Replies
 Andrew Dabrowski 1 Vote That's spectacular. Do you have an online gallery of examples?
2 months ago
 No, but you can find plenty of examples online. Just look for "neural style transfer".
2 months ago
 Applause! That's art done in an interesting way. Can others try it? Do you have a Package or CDF on topic "morphing" ?
2 months ago
 No, but you can easily run this yourself! You can try to implement the original algorithm as well ( https://arxiv.org/pdf/1508.06576.pdf ), which extract style features from multiple VGG layers (while I've used only one layer in the example above)
2 months ago
 I'm trying to follow your instructions, but that vgg file is massive, 4GB. Is that why you truncated it at relu4_1? The extracted featureNet is only 87MB.
2 months ago
 Weird. The full VGG16 should be around 400-500MB, 4GB is way too much. Are you sure about this?The reason why it's truncated to relu4_1 is not related to file size, it's because you want to extract features from a convolutional layer. Features from the LinearLayers at the end of the net are very specialized for classification (which is what VGG's were made for) and have discarded the information about style and content that we want to exploit.Anyway, most of the parameters (and therefore the file size) of the VGG16 lie in the last few LinearLayers. So yes, by taking them out you get a much lighter model.
2 months ago
 OK, 4GB is what it comes to when I saved it locally so I wouldn't have to download it every time. I guess Save["file.m",vggNet] isn't the way to go...
2 months ago
 Well, definitely not! The proper thing to do is export to a .wlnet file, the specific file format for Wolfram Language neural nets. Anyway, NetModel[] should cache the downloaded file to your local system somewhere, so that you don't have to download it every time. Just try calling the NetModel again, you should get the net much faster than the first evaluation.
2 months ago
 But you also cut out a number of convolution layers, they go up to conv5_3.Another question: why did you run the outputs of extractFeatures through gramMatrix before computing the losses? Did you try it without gramMatrix and get poor results? I can imagine that the internal linear relationships in the feature matrix may be more important than the precise values, but it's not so obvious that's the case. What led you to that?Later: Oh, I get it: the pics are not of the same size so this is only comparison possible.
2 months ago
 But you also cut out a number of convolution layers, they go up to conv5_3. Yes, that's because deeper convolutional layers start to behave as the final linear layers: too much information is discarded. Just try chopping the net to one of those, you will observe a very poor quality in the final result. On the other hand, with very shallow layers you only capture vaery basic features (like color) of the original style. For a single-layer simplified version like this, mid-deep layers are just the sweet spot. Another question: why did you run the outputs of extractFeatures through gramMatrix before computing the losses? Did you try it without gramMatrix and get poor results?I can imagine that the internal linear relationships in the feature matrix may be more important than the precise values, but it's not so obvious that's the case. What led you to that?Later: Oh, I get it: the pics are not of the same size so this is only comparison possible. Just to clarify: this algorithm was not invented by me! It was published in this paper, which started an interesting line of research on these methods. About your questions:1 - If you look at the content loss, matching the features directly instead of their gram matrices will match content. So removing the gram matrix from play will result in the algorithm trying to match the content of both targets, i.e. you will effectively use two content losses and no style loss at all. 2 - As you also noticed, the features themselves contain a lot of spatial information, i.e. the look of the original picture at particular pixel values. If we want to capture the general style, we are not interested in that. The gram matrix is an effective way to disregard spatial information and only keep the correlation between channels. As you observe, this also allows to use content and style images of different sizes, but that's just a nice collateral effect.3 - Beyond the observation of spatial information being discarded, the exact reason why gram matrices can effectively encode the style information has been a mystery for a while - no one really knew, they just worked. But some months ago this nice paper solved the mystery, recasting the style transfer problem to the problem of aligning the distributions of the features. Matching the gram matrices is just a particular alignment. In the paper the show different methods and compare the results.
2 months ago
 Thanks for that, I think I'm getting it. It seems that by using the gram matrix we're effectively defing "style" as the pattern of contrasts and similarities in an image independent of content. It seems to work quite well for Impressionism.
2 months ago
 I got everything to run on my computer, with the slight exception that when I try to use the GPU I get the following error. An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging. Stack trace returned 6 entries: [bt] (0) /opt/Wolfram/Mathematica/11.1/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/libmxnet_wri.so(_ZN4dmlc15LogMessageFatalD1Ev+0x26) [0x7fd784130b76] [bt] (1) /opt/Wolfram/Mathematica/11.1/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/libmxnet_wri.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x3bf) [0x7fd784159bef] [bt] (2) /opt/Wolfram/Mathematica/11.1/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/libmxnet_wri.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x50) [0x7fd78415c730] [bt] (3) /lib64/libstdc++.so.6(+0xbb970) [0x7fd7dae68970] [bt] (4) /lib64/libpthread.so.0(+0x72e7) [0x7fd7db5482e7] [bt] (5) /lib64/libc.so.6(clone+0x3f) [0x7fd7da5cb54f] I have successfully run other nets on the GPU. Maybe ADAM is the problem?
2 months ago
 Is the Mathematica session crashing? If not, can you evaluate Internal$LastInternalFailure right after the error and post the result? Answer 2 months ago  No, the session didn't crash entirely, just the kernel. Do you think it might help to restart the session?Internal$LastInternalFailure evaluates only to itself, presumably becauswe the kernel was just restarted.Btw, I'm on Arch Linux 64bit.
2 months ago
 Actually what I posted earlier was not the full error message. The culprit seems to be lack of memory. My GPU is the GeForce GTX 750, I guess that doesn't have enough oomph. The GPU works OK with smaller images.
2 months ago
 Andrew Dabrowski 1 Vote OK, here's my first successful attempt.+=Thank you Matteo for this wonderful toy!
2 months ago
 What was your hardware set up for this post?
2 months ago
 i7-4790 CPU, 4 cores, 16GB ramThe van Gogh Penguin was produced without using the GPU, my GeForce GTX 750 isn't good enough. Running time with the CPU was probably under an hour.
2 months ago
 Trying myself on Windows 10 i7 Lenovo 260but get a couple of errors:
2 months ago
 Memory might be an issue, how much RAM do you have?You might start off with small images so that's not an issue.
2 months ago
 16 GB shouldn't be a problem
2 months ago
 Yeah. But apparently it succeeded in computing the StyleFeature but not the ContentFeature. Is your contentImg larger than the styleImg? What are the ImageDimensions?
2 months ago
 Thanks, I shrunk resolution, it seems to run now ! :)
2 months ago
2 months ago
2 months ago
 Cool, it even seems to have captured the disintegrating paint. Ulla makes a fine cherub.
2 months ago
 Hi Matteo,Interesting post. I try to run this code, but meet some problem with NetModel. I use Mathematica 11.1.1 on Windows 10 operation system.I get error of the NetModel like below. Import::wlcorr: File is corrupt or is not a WLNet file.Any suggestion to solve this problem？ Thanks.
2 months ago
 I think, I have download the WLNet file. But it's not working with Import.
2 months ago
 Looks like your file got corrupted during download for some reason. Try to run ResourceRemove[ResourceObject["VGG-16 Trained on ImageNet Competition Data"]] to clear the file from you system and NetModel["VGG-16 Trained on ImageNet Competition Data"] to download it again.
2 months ago
 Dear Matteo,That's great work! What is the difference between your code (your method) with another applications such as Prisma app (https://itunes.apple.com/us/app/prisma-photo-editor-art-filters-pic-effects/id1122649984?mt=8) for image transformation?Regards,
2 months ago
 Well, no one knows exactly, as Prisma is not open source (to my knowledge). I also don't have direct experience o using prisma, but i guess it leverages the fast, feedforward-based methods. Those approached are about 100 times faster (or so) than this implementation (which is an optimization-based algorithm), although they generally provide lower quality results.A lot of research has been made on these methods recently, so the algorithms are continously evolving.If you are interested, you can check this very nice review about the current situation: https://arxiv.org/pdf/1705.04058.pdf
2 months ago
 I have tried this example by this codeBut I get a bad result.You can find more example here.
2 months ago
 Those examples are produced by a different, more complex algorithm, which is tailored to produce photorealistic transfers. The algorithm i've presented lacks this feature, and is intended to produce "artistic" results instead of real-looking images. That's the original, first step in the world of neural style transfer. The code you linked is one of the many applications which build on top of that.
2 months ago
 Dear Matteo,We have an image of daily air temperature time series for duration of 2010-2016 (for example) .Can we find the maximum or minimum values of this image using Mathematica?Regards,