Group Abstract

Message Boards

WOLFRAM COMMUNITY

17.5K Views

4 Replies

11 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Computer Science Image Processing Mathematics External Programs and Systems Wolfram Language Statistics and Probability Machine Learning Know-How Neural Networks

[Mathematica-vs-R] Handwritten digits recognition by matrix factorization

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

Introduction This MathematicaVsR project is for comparing Mathematica and R for the tasks of classifier creation, execution, and evaluation using the MNIST database of images of handwritten digits. Here are the bases built with two different classifiers: Singular Value Decomposition (SVD) Non-Negative Matrix Factorization (NNMF) Here are the confusion matrices of the two classifiers: SVD (total accuracy: 0.957) NNMF (total accuracy: 0.9663) The blog post "Classification of handwritten digits" (published 2013) has a related more elaborated discussion over a much smaller database of handwritten digits. Concrete steps The concrete steps taken in scripts and documents of this project follow. Ingest the binary data files into arrays that can be visualized as digit images. We have two sets: 60,000 training images and 10,000 testing images. Make a linear vector space representation of the images by simple unfolding. For each digit find the corresponding representation matrix and factorize it. Store the matrix factorization results in a suitable data structure. (These results comprise the classifier training.) One of the matrix factors is seen as a new basis. For a given test image (and its linear vector space representation) find the basis that approximates it best. The corresponding digit is the classifier prediction for the given test image. Evaluate the classifier(s) over all test images and compute accuracy, F-Scores, and other measures. Scripts There are scripts going through the steps listed above: Mathematica : "./Mathematica/Handwritten-digits-classification-by-matrix-factorization.md" R : "./R/HandwrittenDigitsClassificationByMatrixFactorization.Rmd". Documents The following documents give expositions that are suitable for reading and following of steps and corresponding results. Mathematica : "./Mathematica/Handwritten-digits-classification-by-matrix-factorization.pdf". R : "./R/HandwrittenDigitsClassificationByMatrixFactorization.pdf", "./R/HandwrittenDigitsClassificationByMatrixFactorization.html". Observations Ingestion I figured out first in R how to ingest the data in the binary files of the MNIST database. There were at least several online resources (blog posts, GitHub repositories) that discuss the MNIST binary files ingestion. After that making the corresponding code in Mathematica was easy. Classification results Same in Mathematica and R for for SVD and NNMF. (As expected.) NNMF NNMF classifiers use the MathematicaForPrediction at GitHub implementations: NonNegativeMatrixFactorization.m and NonNegativeMatrixFactorization.R. Parallel computations Both Mathematica and R have relatively simple set-up of parallel computations. Graphics It was not very straightforward to come up in R with visualizations for MNIST images. The Mathematica visualization is much more flexible when it comes to plot labeling. Going further Comparison with other classifiers Using Mathematica's built-in classifiers it was easy to compare the SVD and NNMF classifiers with neural network ones and others. (The SVD and NNMF are much faster to built and they bring comparable precision.) It would be nice to repeat that in R using one or several of the neural network classifiers provided by Google, Microsoft, H2O, Baidu, etc. Classifier ensembles Another possible extension is to use classifier ensembles and Receiver Operation Characteristic (ROC) to create better classifiers. (Both in Mathematica and R.) Importance of variables Using classifier agnostic importance of variables procedure we can figure out : which NNMF basis vectors (images) are most important for the classification precision, which image rows or columns are most important for each digit, or similarly which image squares of a, say, 4x4 image grid are most important.

POSTED BY: Anton Antonov

4 Replies

Sort By:

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 9 years ago

- another post of yours has been selected for the Staff Picks group, congratulations! We are happy to see you at the top of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: EDITORIAL BOARD

Daniel Lichtblau

Daniel Lichtblau, Wolfram Research

Posted 9 years ago

Anton, This is very nice. One remark is that I've had some success using the Fourier Sine or Cosine transform (I'll use FCT in referring to this below) as a preprocessor. The idea is to use it on the image matrix, then extract a submatrix of say the lowest 10x10 components and use only that in the SVD phase. This offers three advantages. (1) It improves speed. The FCT is fast enough that it is not a bottleneck, and by reducing size of the input to SVD that becomes much faster. (2) It cuts down the memory footprint. FCT does not really take much memory since it can be done one at a time. SVD is done on the entire training ensemble and on a large matrix it will thus require more memory relative to a smaller one. (3) The extra level of dimension reduction via FCT improves the outcome. Depending on fine tuning of parameter settings (how many FCT components, how many singular values, how many lookup values to use in a majority rules classifier) it can hit 97.8% correct recognition. Also if we look for the 5 nearest neighbors, my recollection is that nearly always at least one will be the correct digit. Details of all this will be in the SYNASC 2016 proceedings (final version was submitted yesterday, and I think it comes out in January or so). The code for the MNIST example is not very long so I'll include it below. Since the images are coarse (28x28) I also sharpen them. That turns out to be the biggest speed bottleneck actually (takes several minutes whereas the rest totals less than a minute). Also I use the fourth Fourier Sine transform. I can get similar results with the second Cosine transform, if memory serves correctly. keep = 31; dn = 10; dst = 4; Clear[nearestImages, processInput]; nearestImages[ilist_, vals_, dn_, dnum_, keep_] := Module[ {idata, images = ilist, dcts, top, topvecs, uu, ww, vv, udotw, norms}, idata = Map[ImageData, images]; dcts = Map[FourierDST[# - Mean[Flatten[#]], dnum] &, idata]; top = dcts[[All, 1 ;; dn, 1 ;; dn]]; topvecs = Map[Flatten, top]; topvecs = Map[# - Mean[#] &, topvecs]; {uu, ww, vv} = SingularValueDecomposition[topvecs, keep]; udotw = uu.ww; norms = Map[Sqrt[#.#] &, udotw]; udotw = udotw/norms; udotw = Join[udotw, Transpose[{Log[norms]}], 2]; {Nearest[udotw -> vals, Method -> "KDTree"], vv}] processInput[ilist_, vv_, dn_, dnum_] := Module[ {idata, images = ilist, dcts, top, topvecs, tdotv, norms}, idata = Map[ImageData, images]; dcts = Map[FourierDST[# - Mean[Flatten[#]], dnum] &, idata]; top = dcts[[All, 1 ;; dn, 1 ;; dn]]; topvecs = Map[Flatten, top]; topvecs = Map[# - Mean[#] &, topvecs]; tdotv = topvecs.vv; norms = Map[Sqrt[#.#] &, tdotv]; tdotv = tdotv/norms; tdotv = Join[tdotv, Transpose[{Log[norms]}], 2]; tdotv] trainingBytes = Import[ "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" , "Byte"]; trainingImages = Map[Image[Partition[#, 28]] &, Partition[Drop[trainingBytes, 16], 28^2]]; testBytes = Import[ "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz" , "Byte"]; testImages = Map[Image[Partition[#, 28]] &, Partition[Drop[testBytes, 16], 28^2]]; trainingLabels = Drop[Import[ "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", "Byte"], 8]; testLabels = Drop[Import[ "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz", "Byte"], 8]; func = Sharpen[#, 12] &; trainingImages = Map[func, trainingImages]; testImages = Map[func, testImages]; guesses[nf_, tvecs_, n_] := Module[ {nbrs, counts}, nbrs = Map[nf[#, n] &, tvecs]; counts = Map[Tally, nbrs]; counts = Map[Reverse, Map[SortBy[#[[2]] &], counts]]; counts[[All, 1, 1]]] correct[guess_, actual_] /; Length[guess] == Length[actual] := Count[guess - actual, 0] correct[__] := $Failed AbsoluteTiming[{nf, vv} = nearestImages[trainingImages, trainingLabels, dn, dst, keep]; testvecs = processInput[testImages, vv, dn, dst];] (* Out[121]= {4.248808, Null} ) AbsoluteTiming[ guessed = guesses[nf, testvecs, 5];] correct[guessed, testLabels] ( Out[122]= {13.071579, Null} Out[123]= 9781 *)

Anton,

This is very nice. One remark is that I've had some success using the Fourier Sine or Cosine transform (I'll use FCT in referring to this below) as a preprocessor. The idea is to use it on the image matrix, then extract a submatrix of say the lowest 10x10 components and use only that in the SVD phase. This offers three advantages.

(1) It improves speed. The FCT is fast enough that it is not a bottleneck, and by reducing size of the input to SVD that becomes much faster.

(2) It cuts down the memory footprint. FCT does not really take much memory since it can be done one at a time. SVD is done on the entire training ensemble and on a large matrix it will thus require more memory relative to a smaller one.

(3) The extra level of dimension reduction via FCT improves the outcome. Depending on fine tuning of parameter settings (how many FCT components, how many singular values, how many lookup values to use in a majority rules classifier) it can hit 97.8% correct recognition. Also if we look for the 5 nearest neighbors, my recollection is that nearly always at least one will be the correct digit.

Details of all this will be in the SYNASC 2016 proceedings (final version was submitted yesterday, and I think it comes out in January or so). The code for the MNIST example is not very long so I'll include it below. Since the images are coarse (28x28) I also sharpen them. That turns out to be the biggest speed bottleneck actually (takes several minutes whereas the rest totals less than a minute). Also I use the fourth Fourier Sine transform. I can get similar results with the second Cosine transform, if memory serves correctly.

keep = 31;
dn = 10;
dst = 4;
Clear[nearestImages, processInput];

nearestImages[ilist_, vals_, dn_, dnum_, keep_] :=
 Module[
  {idata, images = ilist, dcts, top,
   topvecs, uu, ww, vv, udotw, norms},
  idata = Map[ImageData, images];
  dcts = Map[FourierDST[# - Mean[Flatten[#]], dnum] &, idata];
  top = dcts[[All, 1 ;; dn, 1 ;; dn]];
  topvecs = Map[Flatten, top];
  topvecs = Map[# - Mean[#] &, topvecs];
  {uu, ww, vv} =
   SingularValueDecomposition[topvecs, keep];
  udotw = uu.ww;
  norms = Map[Sqrt[#.#] &, udotw];
  udotw = udotw/norms;
  udotw = Join[udotw, Transpose[{Log[norms]}], 2];
  {Nearest[udotw -> vals, Method -> "KDTree"], vv}]

processInput[ilist_, vv_, dn_, dnum_] :=
 Module[
  {idata, images = ilist, dcts, top,
   topvecs, tdotv, norms},
  idata = Map[ImageData, images];
  dcts = Map[FourierDST[# - Mean[Flatten[#]], dnum] &, idata];
  top = dcts[[All, 1 ;; dn, 1 ;; dn]];
  topvecs = Map[Flatten, top];
  topvecs = Map[# - Mean[#] &, topvecs];
  tdotv = topvecs.vv;
  norms = Map[Sqrt[#.#] &, tdotv];
  tdotv = tdotv/norms;
  tdotv = Join[tdotv, Transpose[{Log[norms]}], 2];
  tdotv]

trainingBytes = Import[
   "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" , "Byte"];
trainingImages = Map[Image[Partition[#, 28]] &, Partition[Drop[trainingBytes, 16], 28^2]];
testBytes = Import[
   "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz" , "Byte"];
testImages = Map[Image[Partition[#, 28]] &, Partition[Drop[testBytes, 16], 28^2]];
trainingLabels = Drop[Import[
    "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", "Byte"], 8];
testLabels = Drop[Import[
    "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz", "Byte"], 8];

func = Sharpen[#, 12] &;
trainingImages = Map[func, trainingImages];
testImages = Map[func, testImages];

guesses[nf_, tvecs_, n_] := Module[
  {nbrs, counts},
  nbrs = Map[nf[#, n] &, tvecs];
  counts = Map[Tally, nbrs];
  counts =
   Map[Reverse, Map[SortBy[#[[2]] &], counts]];
  counts[[All, 1, 1]]]

correct[guess_, actual_] /;
  Length[guess] == Length[actual] :=
  Count[guess - actual, 0]
correct[__] := $Failed

AbsoluteTiming[{nf, vv} =
  nearestImages[trainingImages,
   trainingLabels, dn, dst, keep];
 testvecs =
  processInput[testImages, vv, dn, dst];]

(* Out[121]= {4.248808, Null} *)

AbsoluteTiming[
 guessed = guesses[nf, testvecs, 5];]
correct[guessed, testLabels]

(* Out[122]= {13.071579, Null}

Out[123]= 9781 *)

POSTED BY: Daniel Lichtblau

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

Thank you for your code, Dan! And sorry for my delayed reply -- I was too busy analyzing Trump tweets. Some comments follow. 1. I used NNs based image classifiers for couple of my consultancy projects. In my opinion, the use of NNs methods is underestimated in image classification. 2. Using the function `CrossTabulate` I visualized the confusion matrices for several experiments using different values for `keep`. (The variable `keep` is not defined in your code.) Pretty good overall accuracy is achieved with `keep = 40` and reasonably fast. 3. Instead of using `Sharpen` I was thinking to do the opposite -- use Gaussian blur. 4. It might be a good idea to look in ROC application to your approach. (I might do that later this week.) Here is experimental code for over a set of `keep` values: Table[( trainingTime = AbsoluteTiming[ {nf, vv} = nearestImages[trainingImages, trainingLabels, dn, dst, keep]; testvecs = processInput[testImages, vv, dn, dst]; ][[1]]; classificationTime = AbsoluteTiming[guessed = guesses[nf, testvecs, 5];][[1]]; ColumnForm[{ Row[{"keep = ", keep}], Row[{"Training time, (s): ", trainingTime}], Row[{"Classification time, (s): ", classificationTime}], Row[{"Overall accuracy: ", correct[guessed, testLabels]/Length[testLabels] // N}], MatrixForm@CrossTabulate[Transpose[{guessed, testLabels}]] }] ), {keep, {5, 10, 15, 20, 30, 40, 60, 100}}] And here is the output: Note that increasing `keep` after `30` does not produce significantly better results.

POSTED BY: Anton Antonov

Daniel Lichtblau

Daniel Lichtblau, Wolfram Research

Posted 9 years ago

Anton, I had `keep=31;`. Once upon a time I had also tried different preprocessing, including blurring. My best result came with sharpening but I cannot say I tried all variations at the same settings and maybe blurring is better for the MNIST suite. It might also be the case that one should use regular, sharpened, and blurred versions in the training stage. Your nice charts are giving me a case of "output envy".

POSTED BY: Daniel Lichtblau

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

[Mathematica-vs-R] Handwritten digits recognition by matrix factorization

Introduction

Concrete steps

Scripts

Documents

Observations

Ingestion

Classification results

NNMF

Parallel computations

Graphics

Going further

Comparison with other classifiers

Classifier ensembles

Importance of variables