Introduction
Disclaimer : I am an ARM employee but this is a personal work.
I wanted to understand how to convert a Neural network from Mathematica to use it on a Cortex-M with the CMSIS-NN library. CMSIS-NN is a free ARM library containing a few optimized functions for Neural networks on embedded systems (convolutional layers and fully connected).
There are a few demos (CIFAR and Keyword spotting) running on Cortex-M. There were generated either from Caffe framework or with TensorFlow Lite.
I wanted to do the same from Mathematica and also understand a bit more the CMSIS-NN library. So, I attempted to reproduce a keyword spotting example.
The Network
The network is quite simple but for an embedded system we cannot use something too complex. First the audio is going through a MFCC step:
audioEnc =
NetEncoder[{"AudioMFCC", "WindowSize" -> 4*160, "Offset" -> 2*160,
"NumberOfCoefficients" -> 10, "TargetLength" -> 49,
SampleRate -> 16000}]
The network is standard : a few convolutional layers followed by a few fully connected layers:
kwsModel = NetChain[{
ReplicateLayer[1],
ConvolutionLayer[channels[[1]], {10, 4}],
ElementwiseLayer[Ramp],
ConvolutionLayer[channels[[2]], {10, 4},
"Input" -> {channels[[1]], 40, 7}, "Stride" -> {2, 1}],
ElementwiseLayer[Ramp],
LinearLayer[58],
ElementwiseLayer[Ramp],
LinearLayer[128],
ElementwiseLayer[Ramp],
LinearLayer[Length[wantedClass]],
SoftmaxLayer[]
}, "Input" -> audioEnc, "Output" -> classDec]
At the output, there is a NetDecoder which is converting the output into 3 classes. I am trying to detect the word "backward", "yes", "no".
The test patterns are coming from the TensorFlow keyword spotting example (link in attached notebook). But my network is different.
Problems to solve
There are 2 problems to solve to be able to convert this network for CMSIS-NN.
First problem : the library is using a different convention for the tensors which means that the weights have to be reordered before being used by CMSIS-NN. Since it is not too difficult to do with Mathematica, I won't detail it here.
Second problem : CMSIS-NN is using fixed point arithmetic (Q15 or Q7). But Mathematica is using float. Two limitations of Mathematica : there are no quantization operators so we cannot learn the network with the quantization effects. It is not a major issue but we can expect that the quantized network will be less good than if we had trained directly with quantization effects.
Second limitation : During training Mathematica is not keeping track of the statistics of the intermediate values (input and output of layers). But to convert the float into fixed point we need to know some statistics about the dynamic of those values. Once the dynamic is known, the quantization is controlled with parameters of the CMSIS-NN layers : shift values for the weight and bias.
So, to get those statistics I am just applying each layer of the trained network one after another and keeping track of the input / output. I do this on all training patterns. By luck embedded networks are small so even if it is slow to do this, it is not too slow.
I get beautiful histograms (log scale) which are used as a basis to choose how to quantize the values. A simple strategy is to just use the min and max values.
Code generation
Once I have statistics for the dynamics of the values, I can generate C code for CMSIS-NN and C arrays containing the quantized values.
Since quantization has an effect on the performance of the network, I want to be able to test the result easily. So, I have customized CMSIS-NN to be able to run it from Mathematica. The C code generated by the Notebook can be compiled and used with Mathlink.
Like that I can compare the behavior of the original network and the CMSIS-NN quantized one.
Here is an example:
To use the notebook
The steps to convert a network are:
- Train a network
Compute statistics on the network intermediate values
netStats = ComputeAllFiles[result, audioEnc, trainingFiles, SumStat] ;
result is the trained network.
audioEnc is the MFCC
trainingFiles are the training files.
SumStat is the strategy used for the statistics. Here we just get a summary statistics : just min/max
Quantize the network and generate C code
mfcc = audioEnc[AudioResample[SpeechSynthesize["backward"], 16000]];
quantizedNetwork1 = CorrectedFormats[result, netStats, 15, 0];
quantizedNetwork = <| "w" -> quantizedNetwork1["w"],
"net" -> Drop[quantizedNetwork1["net"], 1]|>;
TestPatterns[NetDrop[result, 1],
NetExtract[result, 1][mfcc], quantizedNetwork];
CompileNetwork[
NetDrop[result, 1], NetExtract[result, 1][mfcc],
result[mfcc, None], quantizedNetwork]
TestCode[NetDrop[result, 1], quantizedNetwork];
In this example NetDrop is just dropping the first ReplicateLayer. It does not exist in CMSIS-NN and it is used here just to adapt the tensor shape at the input of the network.
The second and third arguments of CompileNetwork are input and output of the network on one test pattern. It is used only when debugging the network.
mfcc is the input pattern (mfcc of some audio pattern).
- Compiling the generated code in ctests using the provided Makefile
Linking the executable and start using it
link = Install[
FileNameJoin[{NotebookDirectory[], "ctests", "cmsisnn.exe"}]];
cmsiskws[s_] :=
classDec[CMSISNN[
QuantizeData[15, quantizedNetwork["net"][[1, 2]],
Transpose[audioEnc[s] // ReplicateLayer[1] , {3, 2, 1}] //
Flatten]]];
The cmsiskws is a convenience function. It is quantizing the input data using the format computed during quantization of the network.
Then it is computing the MFCC of the audio, reordering the data (Transpose) to use the same convention as CMSIS-NN. Then the CMSISNN function is called on the result.
cmsiskws[AudioResample[SpeechSynthesize["yes"], 16000]]
We can now test that this C code can recognize the word "yes".
The same notebook and same principles were used on CIFAR example.
I can't include the zip containing the C sources to this post. It is not accepted. Without those C sources you won't be able to reproduce the results of this post.
Any idea how
Attachments: