It has been a long time since I posted on community - when I was converting this particular model (MobileNetV2) almost a year back, I had all the motivation to convert it, put it in the net repository, and train it on facial features to continue making the snap chat filters for handheld devices. Somehow I lost the motivation, and got busy with rather mundane grown-up activities. Recently, we had a team meeting where we were discussing how we are looking for more user submissions and how we could encourage our very talented community members to convert models and submit it for the Wolfram Neural Net Repository (through the Contact Us button). This also gives me the opportunity to appreciate our long-time user Julian Francis, who has already converted 6 models (3 published and 3 in curation):
https://blog.wolfram.com/2018/12/06/deep-learning-and-computer-vision-converting-models-for-the-wolfram-neural-net-repository
Disclaimer
Just a disclaimer, in this post we are going to discuss the non-automated way of converting models to Wolfram Language. This approach is ideal for anyone who is starting to create their own models in Wolfram Language or trying to learn any other framework (not just submit and run codes).
Step 1: Figure out the architecture
My aim was to convert MobileNetv2, so the first step is to thoroughly study the architecture of the model. While going through the paper gives the general overview, one needs to actually make their hand dirty by closely examining the nitty-gritty details of the code. For example, it took me a whole week to figure out the details of the tensorflow code:
mobilenet architecture
convolution blocks
Most times, using tensorboard to visualize the graph helps one to figure out connections, while the code helps to figure out the details and the parameters.
Once you have reviewed the code in the framework from where you are converting, you can start designing the architecture in Wolfram Language
Step 2: Coding it in Wolfram Language
If you closely examine the code in the tensorflow, you would realize that there are repeating units, which we want to fully utilize in our code as well.
As you would see in this case, there are mobileunits which contain a chain of ConvolutionLayer, BatchNormalization and Activation of ReLU6. There are three types of these units, one which contain symmetric padding, one with no activation (at the beginning of the network), and one with asymmetric padding.
mobileunit[prefix_,nchannels_,kernel_,stride_,pad_,ngroup_,type_]:=
Which[
type==1,
NetGraph[
<|
"conv"<>prefix->ConvolutionLayer[nchannels,kernel,"Stride"->stride,"PaddingSize"-> pad,"ChannelGroups"-> ngroup],
"conv"<>prefix<>"_bn"-> BatchNormalizationLayer[],
"relu"<>prefix-> ElementwiseLayer[( Min[Max[0,#],6]&)]
|>,
{NetPort["Input"]-> 1-> 2-> 3}],
type==2,
NetGraph[
<|
"conv"<>prefix-> ConvolutionLayer[nchannels,kernel,"Stride"->stride,"PaddingSize"-> pad,"ChannelGroups"-> ngroup],
"conv"<>prefix<>"_bn"-> BatchNormalizationLayer[]
|>,
{NetPort["Input"]-> 1-> 2}],
type==3,
NetGraph[
<|
"conv"<>prefix-> ConvolutionLayer[nchannels,kernel,"Stride"->stride,"PaddingSize"-> {{Ceiling[kernel[[1]]/2]-2,Ceiling[kernel[[2]]/2]},{Ceiling[kernel[[1]]/2]-2,Ceiling[kernel[[2]]/2]}},"ChannelGroups"-> ngroup],
"conv"<>prefix<>"_bn"-> BatchNormalizationLayer[],
"relu"<>prefix-> ElementwiseLayer[( Min[Max[0,#],6]&)]
|>,
{NetPort["Input"]-> 1-> 2-> 3}]
]
Our next job is to put these units in the inverted residual units, that contain an expansion block, a depthwise expansion block with the channelwise expansion parameter, followed by a linear block, just chained together.
invresunit[prefix_,nchannels_,kernel_,stride_,pad_,ngroup_,type_]:=
Which[
type==1,
NetChain
[{mobileunit[prefix<>"_expand",ngroup,1,1,0,1,1],
mobileunit[prefix<>"_dwise",ngroup,kernel,stride,pad,ngroup,1],
mobileunit[prefix<>"_linear",nchannels,1,1,0,1,2]}],
type==2,
NetChain
[{mobileunit[prefix<>"_dwise",ngroup,kernel,stride,pad,ngroup,1],
mobileunit[prefix<>"_linear",nchannels,1,1,0,1,2]}],
type==3,
NetChain
[{mobileunit[prefix<>"_expand",ngroup,1,1,0,1,1],
mobileunit[prefix<>"_dwise",ngroup,kernel,stride,pad,ngroup,3],
mobileunit[prefix<>"_linear",nchannels,1,1,0,1,2]}]
]
Also, there are mobileunit blocks, which are very similar to invresunits, except that they have the residual skip connections (i.e. a ThreadingLayer which adds)
mobilenetblock[prefix_,nchannels_,kernel_,stride_,pad_,ngroup_]:=
NetGraph[{mobileunit[prefix<>"_expand", ngroup, 1, 1, 0, 1, 1],
mobileunit[prefix<>"_dwise", ngroup, kernel, stride, pad, ngroup, 1],
mobileunit[prefix<>"_linear", nchannels, 1, 1, 0, 1, 2],
ThreadingLayer[Plus]},
{1->2->3->4, NetPort["Input"]->4}]
Finally, we put the blocks together to create the mobilenet:
genmobilenet[c1_, c2_, c3_, c4_, c5_, c6_, c7_, c8_, c9_, g1_, g2_,
g3_, g4_, g5_, g6_, g7_, p_, dim_] :=
NetChain[<|
"1" -> mobileunit["1", c1, {3, 3}, {2, 2}, {0, 0}, 1, 3],
"2_1" -> invresunit["2_1", c2, {3, 3}, {1, 1}, {1, 1}, g1, 2],
"2_2" -> invresunit["2_2", c3, {3, 3}, {2, 2}, {0, 0}, g2, 3],
"3_1" -> mobilenetblock["3_1", c3, {3, 3}, {1, 1}, {1, 1}, g3],
"3_2" -> invresunit["3_2", c4, {3, 3}, {2, 2}, {0, 0}, g3, 3],
"4_1" -> mobilenetblock["4_1", c4, {3, 3}, {1, 1}, {1, 1}, g4],
"4_2" -> mobilenetblock["4_2", c4, {3, 3}, {1, 1}, {1, 1}, g4],
"4_3" -> invresunit["4_3", c5, {3, 3}, {2, 2}, {0, 0}, g4, 3],
"4_4" -> mobilenetblock["4_4", c5, {3, 3}, {1, 1}, {1, 1}, g5],
"4_5" -> mobilenetblock["4_5", c5, {3, 3}, {1, 1}, {1, 1}, g5],
"4_6" -> mobilenetblock["4_6", c5, {3, 3}, {1, 1}, {1, 1}, g5],
"4_7" -> invresunit["4_7", c6, {3, 3}, {1, 1}, {1, 1}, g5, 1],
"5_1" -> mobilenetblock["5_1", c6, {3, 3}, {1, 1}, {1, 1}, g6],
"5_2" -> mobilenetblock["5_2", c6, {3, 3}, {1, 1}, {1, 1}, g6],
"5_3" -> invresunit["5_3", c7, {3, 3}, {2, 2}, {0, 0}, g6, 3],
"6_1" -> mobilenetblock["6_1", c7, {3, 3}, {1, 1}, {1, 1}, g7],
"6_2" -> mobilenetblock["6_2", c7, {3, 3}, {1, 1}, {1, 1}, g7],
"6_3" -> invresunit["6_3", c8, {3, 3}, {1, 1}, {1, 1}, g7, 1],
"6_4" -> mobileunit["6_4", c9, {1, 1}, {1, 1}, {0, 0}, 1, 1],
"pool6" -> PoolingLayer[{p, p}, {1, 1}, "Function" -> Mean],
"fc7" -> ConvolutionLayer[1001, {1, 1}],
"reshape" -> ReshapeLayer[{1001}],
"prob_softmax" -> SoftmaxLayer[]
|>,
"Input" -> {3, dim, dim}]
Considering the depth parameter of 1.4:
{c1,c2,c3,c4,c5,c6,c7,c8,c9} = {48, 24, 32, 48, 88, 136, 224, 448, 1792}
And for an input size of 224, the value of p=7
Step 3: Importing the Weights
Once the architecture is built, the next step is to get the pre-trained weights of the model. Thanks to ExternalEvaluate this can be now easily done from within the Wolfram Language. Note than you need to have Python, Numpy, Tensorflow and all the libraries that the model depends on already installed in the correct path.
session = StartExternalSession["Python-NumPy"]
weights = ExternalEvaluate[session, "import tensorflow as tf
import numpy as np
import h5py
import sys
sys.path.append('/home/tuseetab/models/research/slim')
from nets.mobilenet import mobilenet_v2
height=224
width=224
channels=3
slim=tf.contrib.slim
X=tf.placeholder(tf.float32,shape=[None,height,width,channels])
with slim.arg_scope(mobilenet_v2.training_scope(is_training=False)):
logits,end_points=mobilenet_v2.mobilenet(X,num_classes=1001)
with tf.Session() as sess:saver=tf.train.Saver()
saver.restore(sess,checkpoint_dir+'mobilenet_v2_1.4_224.ckpt')
var = tf.global_variables()
weights = sess.run(var)
my_dict = {}
for i in range(len(var)):
my_dict[var[i].name]= weights[i]
my_dict"]
DeleteObject[session]
Step 4: Parsing the Weights
The weights then need to be manually parsed. In this case the layers containing parameters are ConvolutionLayer (weights), BatchNormalizationLayer (moving mean, variances, biases and scaling). While the BatchNormalization parameters are quite simple to parse (they are mostly vectors), the Convolution weight matrices often need to be transposed (to follow the convention of how these matrices are stored in different frameworks). Please see the following code for more clarification.
convw =
Flatten[{{Transpose[Normal@weights[[1]][[1, 2]], {4, 3, 2, 1}],
Transpose[Normal@weights[[1]][[4, 1, 2]], {4, 3, 1, 2}],
Transpose[Normal@weights[[1]][[4, 2, 2]], {4, 3, 2, 1}],
Transpose[Normal@weights[[1]][[5, 2, 2]], {4, 3, 2, 1}],
Transpose[Normal@weights[[1]][[5, 1, 2]], {4, 3, 1, 2}],
Transpose[Normal@weights[[1]][[5, 3, 2]], {4, 3, 2, 1}]},
Flatten[
Table[{Transpose[Normal@weights[[1]][[i, 2, 2]], {4, 3, 2, 1}],
Transpose[Normal@weights[[1]][[i, 1, 2]], {4, 3, 1, 2}],
Transpose[Normal@weights[[1]][[i, 3, 2]], {4, 3, 2, 1}]},
{i, 13, Length@weights[[1]]}], 1],
Flatten[
Table[
{Transpose[Normal@weights[[1]][[i, 2, 2]], {4, 3, 2, 1}],
Transpose[Normal@weights[[1]][[i, 1, 2]], {4, 3, 1, 2}],
Transpose[Normal@weights[[1]][[i, 3, 2]], {4, 3, 2, 1}]},
{i, 6, 12}], 1],
{Transpose[Normal@weights[[1]][[2, 2]], {4, 3, 2, 1}]}}, 1];
bnbeta = Flatten[{{weights[[1]][[1, 1, 1]]},
Table[weights[[1]][[4, i, 1, 1]], {i, 2}],
Table[weights[[1]][[5, i, 1, 1]], {i, {2, 1, 3}}],
Flatten[
Table[weights[[1]][[i, k, 1, 1]],
{i, 13, Length@weights[[1]]}, {k, {2, 1, 3}}]
, 1],
Flatten[
Table[weights[[1]][[i, k, 1, 1]], {i, 6, 12}, {k, {2, 1, 3}}]
, 1],
{weights[[1]][[2, 1, 1]]}}
, 1];
bngamma = Flatten
[{{weights[[1]][[1, 1, 2]]},
Table[weights[[1]][[4, i, 1, 2]], {i, 2}],
Table[weights[[1]][[5, i, 1, 2]], {i, {2, 1, 3}}],
Flatten[
Table[weights[[1]][[i, k, 1, 2]],
{i, 13, Length@weights[[1]]}, {k, {2, 1, 3}}]
, 1],
Flatten[
Table[weights[[1]][[i, k, 1, 2]], {i, 6, 12}, {k, {2, 1, 3}}]
, 1],
{weights[[1]][[2, 1, 2]]}}
, 1];
movmean = Flatten
[{{weights[[1]][[1, 1, 3]]},
Table[weights[[1]][[4, i, 1, 3]], {i, 2}],
Table[weights[[1]][[5, i, 1, 3]], {i, {2, 1, 3}}],
Flatten[
Table[weights[[1]][[i, k, 1, 3]],
{i, 13, Length@weights[[1]]}, {k, {2, 1, 3}}]
, 1],
Flatten[
Table[weights[[1]][[i, k, 1, 3]], {i, 6, 12}, {k, {2, 1, 3}}]
, 1],
{weights[[1]][[2, 1, 3]]}}
, 1];
movvar = Flatten
[{{weights[[1]][[1, 1, 4]]},
Table[weights[[1]][[4, i, 1, 4]], {i, 2}],
Table[weights[[1]][[5, i, 1, 4]], {i, {2, 1, 3}}],
Flatten[
Table[weights[[1]][[i, k, 1, 4]],
{i, 13, Length@weights[[1]]}, {k, {2, 1, 3}}]
, 1],
Flatten[
Table[weights[[1]][[i, k, 1, 4]]
, {i, 6, 12}, {k, {2, 1, 3}}]
, 1],
{weights[[1]][[2, 1, 4]]}}
, 1];
Step 5: Linking the Weights
Yes, the next piece of code is quite tedious, prone to errors, and might have required the most iterations to figure out. Nevertheless, this is the final code -
pref = {"2_2", "3_1", "3_2", "4_1", "4_2", "4_3", "4_4", "4_5", "4_6",
"4_7", "5_1", "5_2", "5_3", "6_1", "6_2", "6_3"};
pref2 = {"_expand", "_dwise", "_linear"};
pref3 = {"_expand_bn", "_dwise_bn", "_linear_bn"};
mobilenet2 = NetReplacePart[mobilenet,
Flatten[
Join[
Thread[
Flatten[
Table[
{i,
Which[k == "_expand", 1, k == "_dwise", 2, k == "_linear",
3],
"conv" <> i <> k,
"Weights"},
{i, pref}, {k, pref2}]
, 1]
-> Table[convw[[i]], {i, 4, 51}]],
Thread[
Flatten[
Table[
{
{i,
Which[k == "_expand_bn", 1, k == "_dwise_bn", 2,
k == "_linear_bn", 3],
"conv" <> i <> k,
"Biases"},
{i,
Which[k == "_expand_bn", 1, k == "_dwise_bn", 2,
k == "_linear_bn", 3],
"conv" <> i <> k,
"Scaling"},
{i,
Which[k == "_expand_bn", 1, k == "_dwise_bn", 2,
k == "_linear_bn", 3],
"conv" <> i <> k,
"MovingMean"},
{i,
Which[k == "_expand_bn", 1, k == "_dwise_bn", 2,
k == "_linear_bn", 3],
"conv" <> i <> k,
"MovingVariance"}}
, {i, pref}, {k, pref3}]
, 2]
-> Flatten[
Table[{bnbeta[[i]], bngamma[[i]], movmean[[i]],
movvar[[i]]}, {i, 4, 51}]
, 1]],
{{"1", "conv1", "Weights"} -> convw[[1]],
{"1", "conv1_bn", "Biases"} -> bnbeta[[1]],
{"1", "conv1_bn", "Scaling"} -> bngamma[[1]],
{"1", "conv1_bn", "MovingMean"} -> movmean[[1]],
{"1", "conv1_bn", "MovingVariance"} -> movvar[[1]],
{"2_1", 1, "conv2_1_dwise", "Weights"} -> convw[[2]],
{"2_1", 1, "conv2_1_dwise_bn", "Biases"} -> bnbeta[[2]],
{"2_1", 1, "conv2_1_dwise_bn", "Scaling"} -> bngamma[[2]],
{"2_1", 1, "conv2_1_dwise_bn", "MovingMean"} ->
movmean[[2]],
{"2_1", 1, "conv2_1_dwise_bn", "MovingVariance"} ->
movvar[[2]],
{"2_1", 2, "conv2_1_linear", "Weights"} -> convw[[3]],
{"2_1", 2, "conv2_1_linear_bn", "Biases"} ->
bnbeta[[3]],
{"2_1", 2, "conv2_1_linear_bn", "Scaling"} ->
bngamma[[3]],
{"2_1", 2, "conv2_1_linear_bn", "MovingMean"} ->
movmean[[3]],
{"2_1", 2, "conv2_1_linear_bn", "MovingVariance"} ->
movvar[[3]],
{"6_4", "conv6_4", "Weights"} -> convw[[52]],
{"6_4", "conv6_4_bn", "Biases"} -> bnbeta[[52]],
{"6_4", "conv6_4_bn", "Scaling"} -> bngamma[[52]],
{"6_4", "conv6_4_bn", "MovingMean"} -> movmean[[52]],
{"6_4", "conv6_4_bn", "MovingVariance"} -> movvar[[52]],
{"fc7", "Weights"} ->
Transpose[Normal@weights[[1]][[3, 1, 2]], {4, 3, 2, 1}],
{"fc7", "Biases"} -> Normal@weights[[1]][[3, 1, 1]]}
], 1]];
Step 7: Making the tests
For performing the tests, we evaluate the model in tensorflow (the original framework), and evaluate the model with zero input and random input. Finally we evaluate the difference between the tensorflow output (imported as a structured HDF5), and the output from evaluating the model in Wolfram Language with zero input and the exact same random input as used in tensorflow.
A snippet of the code to perform the difference is below:
randomInput = Transpose[Normal@tests["RandomInput"], {3, 2, 1}];
fileOutZero = Normal@tests["OutputForZeros"][[All, 1]];
fileOutRandom = Normal@tests["OutputForRandom"][[All, 1]];
netOutZero =
Normal@bareNet [ ConstantArray[0, Dimensions@randomInput]];
netOutRandom = Normal@bareNet [randomInput];
diffZero = Abs[netOutZero - fileOutZero];
diffRandom = Abs[netOutRandom - fileOutRandom];
zeroTest = <|
"MaxAbsoluteDifference" -> Max[diffZero],
"MaxRelativeDifference" ->
Max[diffZero / Clip[Abs@netOutZero, {10^-8., Infinity}]]
|>;
randomTest = <|
"MaxAbsoluteDifference" -> Max[diffRandom],
"MaxRelativeDifference" ->
Max[diffRandom / Clip[Abs@netOutRandom, {10^-8., Infinity}]]
|>;
where tests, contain the HDF5 obtained from tensorflow and contains:
a) Output obtained by running with zeros
b) Random Input
c) Output obtained by running the random input
Step 8: Attach Encoder and Decoder
mean = {0.485, 0.456, 0.406};
variance = {0.052441, 0.050176, 0.050625};
classes =
Prepend[Import @ FileNameJoin[{$CommonDir, "imagenet1000.m"}],
Entity["Concept", "Other::nzvm6"]];
dec = NetDecoder[{"Class", classes}];
net = NetReplacePart[mobilenet,
{"Input" -> NetEncoder[{"Image", {224, 224}}],
"Output" -> dec}]
Here the classes variable imports a file that contains all the 1000 image classes stored as Entities and attaches an extra "Other" class.
Conclusion:
After reading the post, one would often think that it is easier, better, and often less time consuming to opt for using automated tools for model conversion. Yes, when available the tools are definitely the way to go (in fact we have a big drive to develop and use the ONNX converter in the group, although it is not ready and fully tested yet). However, many a times, there are models in the wild that are harder and sometimes impossible to convert using these tools. In those cases, you need to use the traditional, hand conversion methods. In fact, although the process of hand conversion is often laborious, time consuming and in lack of better words, frustrating, this process teaches us a lot about the various frameworks/their differences, the details of the models, and is quintessential if you wish to pursue research in the field for future.