Message Boards Message Boards


[WSS16] Image Colorization

Posted 5 years ago
15 Replies
32 Total Likes

The aim of my project for the Wolfram Science Summer School was to build a neural network which could be able to colorize grayscale images in a realistic way. The network has been built following the article [1]. In this paper, the authors propose a fully automated approach for colorization of grayscale images, which uses a combination of global image features, which are extracted from the entire image, and local image features, which are computed from small image patches. Global priors provide information at an image level such as whether or not the image was taken indoors or outdoors, whether it is day or night, etc., while local features represent the local texture or object at a given location. By combining both features, it's possible to leverage the semantic information to color the images without requiring human interaction. The approach is based on Convolutional Neural Networks, which have a strong capacity for learning and is trained to predict the chrominance of a grayscale image using the CIE Lab* colorspace. Predicting colors has the nice property that training data is practically free: any color photo can be used as a training example.

Net Layers

The model consists of four main components: a low-level features network, a mid-level features network, a global features network, and a colorization network. First, a common set of shared low-level features are extracted from the image. Using these features, a set of global image features and mid-level image features are computed. Then, the mid-level and the global features are both fused by a "fusion layer" and used as the input to a colorization network that outputs the final chrominance map. Each layer has a ReLu transfer function except for the last convolution of the colorization network, where a sigmoid function is applied.The model is able to process images of any size, but it is most efficient when the input images are 224x224 pixels, as the shared low-level features layers can share outputs. Note that when the input image size is of a different resolution, while the low-level feature weights are shared, a rescaled image of size 224x224 must be used for the global features network.This requires processing both the original image and the rescaled image through the low-level features network, increasing both memory consumption and computation time. For this reason, we trained the model exclusively with images of size 224x224 pixels.

Low-Level Features Network

A 6-layer Convolutional Neural Network obtains low-level features directly from the input image. The convolution filter bank the network represents are shared to feed both the global features network and the mid-level features network. In order to reduce the size of the feature maps, we use convolution layers with increased strides instead of using max-pooling layers (as usual for similar kinds of networks). If padding is added to the layer, the output is effectively half the size of the input layer. We used 3x3 convolution kernels exclusively and a padding of 1x1 to ensure the output is the same size (or half if using a stride of 2) as the input.

Global Features Network

The global image features are obtained by further processing the low-level features with four convolutional layers followed by three fully-connected layers.This results in a 256-dimensional vector representation of the image.

Mid-Level Features Network

The mid-level features are obtained by processing the low-level features further with two convolutional layers. The output is bottlenecked from the original 512-channel low-level features to 256-channel mid-level features. Unlike the global image features, the low-level and mid-level features networks are fully convolutional networks, such that the output is a scaled version of the input.

Fusion Layer

In order to be able to combine the global image features, a 256-dimensional vector, with the (mid-level) local image features, a 28x28x256-dimensional tensor, the authors introduce a fusion layer. This can be thought of as concatenating the global features with the local features at each spatial location and processing them through a small one-layer network.This effectively combines the global feature and the local features to obtain a new feature map that is, as the mid-level features, a 3D volume.

Colorization Network

Once the features are fused, they are processed by a set of convolutions and upsampling layers, which use the nearest neighbour technique so that the output is twice as wide and twice as tall. These layers are alternated until the output is half the size of the original input. The output layer of the colorization network consists of a convolutional layer with a Sigmoid transfer function that outputs the chrominance of the input grayscale image. Finally, the computed chrominance is upsampled and combined with the input intensity/luminance image to produce the resulting color image. In order to train the network, we used the Mean Square Error (MSE) criterion. Given a color image for training, the input of the model is the grayscale image while the target output is the ab components of the CIE Lab* colorspace. The ab components are globally normalized so they lie in the [0,1] range of the Sigmoid transfer function.

Colorization with Classification

While training with only color images using the MSE criterion does give good performance, sometimes it could make obvious mistakes due to not properly learning the global context of the image, e.g., whether it is indoors or outdoors. As learning these networks is an non-convex problem, we facilitated the optimization by also training for classification jointly with the colorization. As we trained the model using a large-scale dataset for classification of N classes (Mathematica ImageIdentify dataset), we had classification labels available for training. These labels correspond to a global image tag and thus can be used to guide the training of the global image features. We did this by introducing another very small neural network that consists of two fully-connected layers: a hidden layer with 256 outputs and an output layer with as many outputs as the number of classes in the dataset. The input of this network is the second to last layer of the global features network with 512 outputs. We trained this network using the cross-entropy loss, jointly with the MSE loss for the colorization network.


The aim of my project was to build the network described in the paper using the new NeuralNetworks framework of Mathematica 11. In order to achieve this, some adjustments were needed. First of all, we decided to train and evaluate the network only on images of 224x224 pixels size, in order to use (and train) only one low-level features network, instead of two with shared weights and different outputs. The final network has two inputs: the first one is the colored 224x224 px image, encoded by the "NetEncoder" function in LAB colorspace, the second one the class of the image. The two outputs (named "Loss" and "Output") represent the values of the two loss function used (one for the colorization, the other one for the classification), which are then summed together by the NetTrain function. The three color channels of the input image are split by the split layer: the L channel feeds the "low-level features" network, while the a,b channels are scaled and concatenated in order to obtain a target set for the mean squared loss function comparable with the output of the colorization network. The fusion layer has been replaced by a broadcast layer, which joins the rank 3- tensor, output of the mid-level network, with the vector from the global features network. However, the way they are combined is not exactly the same as the one described in the paper. To evaluate the trained network on a grayscale image it's necessary to drop some branches of the network, such as the classification network and the layers that process the a,b channels of the colored input image in order to produce the target set for the colorization loss function.

Network described in the paper Network implementation with Mathematica NeuralNetworks framework


enter image description here


The network described in the paper has been trained on the Places scene dataset [Zhou et al. 2014], which consists of 2,448,872 training images and 20,500 validation images, with 205 classes corresponding to the types of the scene. They filtered the images by removing grayscale images and those that have little color variance with a small automated script. They trained using a batch size of 128 for 200,000 iterations corresponding to roughly 11 epochs. This takes roughly 3 weeks on one core of a NVIDIAR TeslaR K80 GPU. We needed to introduce some new layers in the existing framework and to fix some bugs, so we were able to train our network only for 14 hours on a dataset of 350000 images on one core of a GPU Titan machine. Furthermore, the images in our training set mainly represent specific items, so probably better results may be achieved introducing also images of different types of subjects (landscapes, human created images, indoors, etc). The results we obtained are showed in the section above and are quite good. We are confident that with a deeper and longer training our network would give considerably better results.

Open Problems / Future Developments

Due to the separation between the global and local features, it is possible to use global features computed on one image in combination with local features computed on another image, to change the style of the resulting colorization. One of the more interesting things the model can do is adapting the colorization of one image to the style of another. This is straight-forward to do with this model due to the decorrelation between the global features and the mid-level features. In order to colorize an image A using the style taken from an image B, it's necessary to compute the mid-level local features of image A and the global features from image B. Than it's possible to fuse these features and process them with the colorization network. Both the local and the global features are computed from grayscale images: it's not necessary to use any color information at all.

The main limitation of the method lies in the fact that it is datadriven and thus will only be able to colorize images that share common properties with those in the training set. In order to evaluate on significantly different types of images, it would be necessary to train a the model for all type of images (indoor, outdoor, human-created...). In order to obtain good style transfer results, it is important for both images to have some semantic level of similarity between them.


[1] Satoshi Iizuka, Edgar Simo - Serra, and Hiroshi Ishikawa."Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification".

15 Replies

Very impressive, thanks for sharing!

enter image description here - you earned "Featured Contributor" badge, congratulations !

This is a great post and it has been selected for the curated Staff Picks group. Your profile is now distinguished by a "Featured Contributor" badge and displayed on the "Featured Contributor" board.

Posted 5 years ago

Great work!! It is possible you share the NetChain of the developed network?

This network will only be runnable in the released version of M11.

Also: we are continuing to train the model (requires at least 2-3 weeks of training on a Titan X GPU for optimal performance, the images above were produced by a 14 hour trained net). If there is interest, I can post the trained net in 2 weeks time.

We are also working on a model gallery for 11.1 where we will post trained models like this one.

Posted 5 years ago

Thank you for your answer! Currently I'm testing the pre-release of Mathematica My main objective is to see how you have implemented a network that looks so complex in Mathematica. It is not to run or train. This is why I have only asked for the Netchain. I understand that you probably can not post or send my email. Even so, thank you for sharing the great work that you have done!

Global Variables

(* Loss function parameter and number of classes of the images *)
$\[Alpha] = 1/300;
$numClasses = 4314;

Net Layers

conv[out_Integer, k_Integer, str_Integer, p_Integer] := ConvolutionLayer[out, k, "Stride" -> str, "PaddingSize" -> p]; (* Convolution layer *)
fc[n_Integer] := DotPlusLayer[n]; (* Fully connected layer *)
relu = ElementwiseLayer[Ramp]; (* Ramp activation function *)
\[Sigma] = ElementwiseLayer[LogisticSigmoid];(* Sigmoid activation function *)
\[Sigma]1 = ElementwiseLayer[LogisticSigmoid];
tl1 = ScalarTimesLayer[100];  (* This layer multiplies elementwise the input tensor by a scalar number *)
tl2 = ScalarTimesLayer[100];
timesLoss = ScalarTimesLayer[$\[Alpha]];
bn = BatchNormalizationLayer[]; (* Batch Normalizaion layer *)
upSampl = UpsampleLayer[2]; (* Upsampling using the nearest neighbor techique *)
sl = SplitLayer[False];  (* This layer splits the input tensor into its channels *)     
cl = CatenateLayer[]; (* This layer catenates the input tensors and outputs a new tensor *)

(* "Fusion" layer *)
rshL = ReshapeLayer[{256, 1, 1}]; (* This layer reinterprets the input to be an array of the specified dimensions *)
bl = BroadcastPlusLayer[]; (* This layer catenates a vector all along the corresponding dimension of a tensor *)

(* Loss functions *)
lossMS = MeanSquaredLossLayer[]; 
lossCE = CrossEntropyLossLayer["Index"]; 

Net Chains

(* Low-Level Features Network *)
lln = NetChain[{conv[64, 3, 2, 1], bn, relu, conv[128, 3, 1, 1], bn, relu, conv[128, 3, 2, 1], bn, relu, conv[256, 3, 1, 1], bn, relu, 
    conv[256, 3, 2, 1], bn, relu, conv[512, 3, 1, 1], bn, relu} ];
(* Mid-Level Features Network *)
mln = NetChain[{conv[512, 3, 1, 1], bn, relu, conv[256, 3, 1, 1], bn, relu}];
(* Colorization Network *)
coln = NetChain[{conv[256, 3, 1, 1], bn, relu, conv[128, 3, 1, 1], bn, relu, upSampl, conv[64, 3, 1, 1], bn, relu, conv[64, 3, 1, 1], 
    bn, relu, upSampl, conv[32, 3, 1, 1], bn, relu, conv[2, 3, 1, 1], \[Sigma], upSampl}];
(* Global Features Network *)
gln = NetChain[{conv[512, 3, 2, 1], bn, relu, conv[512, 3, 1, 1], bn, relu, conv[512, 3, 2, 1], bn, relu, conv[512, 3, 1, 1], bn, relu, 
    FlattenLayer[], fc[1024], bn, relu, fc[512], bn, relu}];
gln2 = NetChain[{fc[256], bn, relu}];
(* Classification Network *)
classn = NetChain[{fc[256], bn, relu, fc[$numClasses], bn, relu}];

Net Structure

classNet = NetGraph[
  <| "SplitL" -> sl, "LowLev" -> lln, "MidLev" -> mln, "GlobLev" -> gln, "GlobLev2" -> gln2, "ColNet" -> coln, "Sigmoid" -> \[Sigma]1, "TimesL1" -> tl1, "TimesL2" -> tl2, "CatL" -> cl, "LossMS" -> lossMS, "LossCE" -> lossCE, "Broadcast" -> bl, "ReshapeL" -> rshL, "ClassN" -> classn, "timesLoss" -> timesLoss |>,
  { NetPort["Image"] -> "SplitL",  "SplitL" -> {"LowLev", "TimesL1", "TimesL2"}, {"TimesL1", "TimesL2"} -> "CatL", "CatL" -> "Sigmoid", "LowLev" -> "MidLev", "LowLev" -> "GlobLev", "GlobLev" -> "GlobLev2", "GlobLev" -> "ClassN", "MidLev" -> NetPort["Broadcast", "LHS"], "GlobLev2" -> "ReshapeL", "ReshapeL" -> NetPort["Broadcast", "RHS"], "Broadcast" -> "ColNet",
    "ColNet" -> NetPort["LossMS", "Input"], "Sigmoid" -> NetPort["LossMS", "Target"],  "ClassN" -> NetPort["LossCE", "Input"], 
   NetPort["Class"] -> NetPort["LossCE", "Target"], "LossCE" -> "timesLoss" }, 
  "Image" -> NetEncoder[{"Image", {224, 224}, "ColorSpace" -> "LAB", "Parallelize" -> False}] ]


tnet = NetTrain [
  <|"Image" -> $trainPathsFile, "Class" -> $trainClasses|>,
  ValidationSet -> <|"Image" -> $testPathsFile, "Class" -> $testClasses|>,
  TargetDevice -> {"GPU", 1},
  "Method" -> "ADAM"

Evaluation Net

evalNet = Take[tnet, {"LowLev", "ColNet"}]
evalNet = NetChain[{evalNet}, "Input"->NetEncoder["Image",{224,224},"ColorSpace"->"Grayscale"]];

Thanks for sharing, I was not really sure how to use the NetGraph, NetTrain, and NetChain, now it gives a bit more insight!

Posted 5 years ago

Thank you!!!

This is a good examples article for people who are interested in this project:

Posted 5 years ago

This is very interesting work! I have the NetChain from the discussion above and it looks fine and seems to work with a small training set. It is big training task to use the "Places" dataset and not really practical for the resources I have.

So my question - and this is really a question to Sebastian who mentioned it earlier in the discussion - could the trained network be made available?

There is an upcoming model gallery where this, and many more models will be available.

@Sebastian Bodenstein was this colorization net ever published?

This one, no (for various reasons). A better one will be published very soon.

@Mike Sollami: There are two colorization nets available right now:

NetModel["ColorNet Image Colorization Trained on Places Data (Raw Model)"]


NetModel["ColorNet Image Colorization Trained on ImageNet Competition Data (Raw Model)"]

For usage, see here and here.

Are there any Wolfram Function Repository entries related to these kind of colorizations?

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract