Message Boards Message Boards

Classifying Japanese characters from the Edo period

POSTED BY: Marco Thiel
10 Replies

I should ask: are you (Marco) certain it is fair game to put the test set in as the ValidationSet option setting? It would seem from the documentation that that will cause parameter values to be optimized for the known results, which in turn might lead to a score that is artificially inflated. But maybe I misunderstand what that option really does.

POSTED BY: Daniel Lichtblau

Dear Daniel,

you are quite right. I wondered about that myself, but saw this in one of the presentations by one of your guys at the WTC. I have tried the same thing without that option and got very similar results. It would be great to get the opinion of one of your developers.

I just ran it again (without the option)

lenet = NetTrain[lenet, trainingset, MaxTrainingRounds -> 20];

and got:

cm = ClassifierMeasurements[lenet, testset]
cm["Accuracy"]

0.958452, so nearly 96%. I would love to get some feedback about the ValidationSet option. Thank you for bringing this up.

Best wishes and thanks,

Marco

PS:From the documentation: "ValidationSet->data is typically used when the data in the training set and the data that one wishes to predict or classify come from different sources." That suggests that I should not use the option; but I do not fully understand that scenario. Luckily, in this case it does not make any difference to the main result.

POSTED BY: Marco Thiel

Here is a method that is a bit like using Nearest on images (in that the internal code for that has some similarities). I crib from code located here. The idea quite abbreviated, is to take a sub-array of low frequency Fourier components (I use a DCT for this), flatten these arrays into vectors, extract a singular value decomposition keeping some number of singular values, and use the result to (1) preprocess and (2) look up the test images. We find several "nearest" ones and assign a score by weighting by inverse of proximity to lookup vector.

nearestImages[ilist_, vals_, dn_, dnum_, keep_] :=
 Module[
  {images = ilist, dcts, top,
   topvecs, uu, ww, vv, udotw, norms},
  dcts = Map[FourierDCT[# - Mean[Flatten[#]], dnum] &, images];
  top = dcts[[All, 1 ;; dn, 1 ;; dn]];
  topvecs = Map[Flatten, top];
  topvecs = Map[# &, topvecs];
  {uu, ww, vv} =
   SingularValueDecomposition[topvecs, keep];
  udotw = uu.ww;
  norms = Map[Sqrt[#.#] &, udotw];
  udotw = udotw/norms;
  {Nearest[udotw -> Transpose[{udotw, vals}]], vv}]

processInput[ilist_, vv_, dn_, dnum_] :=
 Module[
  {images = ilist, dcts, top,
   topvecs, tdotv, norms},
  dcts = Map[FourierDCT[# - Mean[Flatten[#]], dnum] &, images];
  top = dcts[[All, 1 ;; dn, 1 ;; dn]];
  topvecs = Map[Flatten, top];
  topvecs = Map[# &, topvecs];
  tdotv = topvecs.vv;
  norms = Map[Sqrt[#.#] &, tdotv];
  tdotv = tdotv/norms;
  tdotv]

guesses[nf_,tvecs_,n_]:=Module[
{nbrs,probs,probsB,bestvals},
probs=Table[
Module[{res=nf[tvecs[[j]],n],dists},
dists=1/Map[Norm[tvecs[[j]]-#,3/2]&,res[[All,1]]];
Thread[{res[[All,2]],dists/Total[dists]}]],
{j,Length[tvecs]}];
probsB=Map[Normal[GroupBy[#,First]]&,probs]/.(val_->vlist:{{val_,_}..}):>(val->Total[vlist[[All,2]]]);
probs=(Range[0,9]/.probsB)/.Thread[Range[0,9]->0];
bestvals=Map[First[Ordering[#,1,Greater]]&,probs,{1}]-1;
bestvals
]

correct[guess_,actual_]/;
Length[guess]==Length[actual]:=
Count[guess-actual,0]
correct[__]:=$Failed

The example proceeds from the point of having imported the arrays into characterImages, as in the original post. We separate inot training and test image arrays and label sets.

trainImages = characterImages[[3, All]]/256.;
trainLabels = Flatten[characterImages[[4, All]]];
testImages = characterImages[[1, All]]/256.;
testLabels = Flatten[characterImages[[2, All]]];

The method has some tuning parameters. The ones used below are in the general vicinity of what is used in the tests at the link given above. We use four neighbors although some experiments indicate 3 might be a better choice for this particular data set. Total run time is a few seconds.

keep = 40;
dn = 20;
dst = 4;
AbsoluteTiming[{nf, vv} =
   nearestImages[trainImages,
    trainLabels, dn, dst, keep];]
AbsoluteTiming[testvecs =
   processInput[testImages, vv, dn, dst];]
guessed = guesses[nf, testvecs, 4];
AbsoluteTiming[corr = correct[guessed, testLabels]]
N[corr/tlen]

(* Out[452]= {2.221543, Null}

Out[453]= {0.114258, Null}

Out[454]= {1.296956, Null}

Out[456]= 0.942231075697 *)

So 94.2% correct, which is not bad. If we bring the number of retained singular values way up we can hit 95% correct.

POSTED BY: Daniel Lichtblau

That's not fair! You used your brain and I used someone else's without using mine!

Your approach is beautiful and makes a lot of sense. Also it is much faster than the ML approach. On the other hand, you needed to understand what you are doing, and I could just rely on the Wolfram Language's built-in intelligence. So on a brain-usage scale you win, on a laziness scale I win...

Your method is so fast that we should be able to run a parameter sweep and try to find "optimal" ones.

BTW, I would expect your methods to run very well on the MNIST dataset as well. The Japanese character set has all these problems as Sean describes.

Thank you very much for that idea! I think that it is very interesting to see how "by-hand" methods can be equally/more powerful (and faster).

Cheers,

Marco

POSTED BY: Marco Thiel

With what seems to be the best tuning I could manage it gets 98% on MNIST. In contrast the best methods, which I believe are NN-based, hit around 99.7% correct if memory serves. There is a related set from the US Postal Service that is somewhat more challenging, with best methods "only" getting around 98% I think.

I confess I may have borrowed a brain for that particular bit of work.

POSTED BY: Daniel Lichtblau

Wonderful work, Marco, thank you for sharing! Have you noticed by any chance what method Classify picked automatically? It can be found in the Classify icon:

enter image description here

POSTED BY: Vitaliy Kaurov

Dear Vitaliy,

yes, indeed. It uses NearestNeighbours. It is quite astonishing that it does so well and is so fast.

I think I will try to explore Sean's suggestion a bit more. I have been speaking with a Chinese colleague (a very experienced calligrapher) and will try to speak to a Japanese colleague tomorrow to see whether there is a way of creating other, similar datasets. I got lots of pages most of which in Chinese. Luckily Google/Mathematica can translate that for me.

It appears that in Chinese there are different calligraphy "schools" or fractions. I think that it would be interesting if we could distinguish automatically between calligraphic glyphs of different schools.

There also seems to be a substantial "evolution" of symbols. It would be cool to follow individual symbols over time and see how they morph into one another.

All the best from Aberdeen,

Marco

POSTED BY: Marco Thiel

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD

It'll take me a while to download the dataset and check, but it looks like there's a lot of hentaigana. I can only recognize some of them. https://en.wikipedia.org/wiki/Hentaigana

To summarize, in older Japanese there are a lot of possible variant characters that can be used to represent the same sound. So your learning task is made much harder because of these. If you could sort them out and learn the variant characters, you'd probably get even better results.

POSTED BY: Sean Clarke

Thank you very much for your suggestion. The issue that I do not speak Japanese becomes quite problematic here. I did wonder why there was such a variety of characters for the "same" symbol, i.e. last table in my post, first row. They look very different, but I have no feeling for how different they are supposed to look.

I do agree that introducing further categories/classes might help. The thing is that the dataset came with this annotation and I have no idea how to do that manually.

I believe that FeatureSpacePlot might help. For example for the first character in the set (with the same variables as in my OP) we get:

FeatureSpacePlot[Select[trainingset, #[[2]] == 0 &][[1 ;; 200, 1]], ImageSize -> Full]

enter image description here

suggests that there are different sub-classes describing the same symbol. A dendrogram might be useful to find different subclasses, too.

Dendrogram[Select[trainingset, #[[2]] == 0 &][[1 ;; 50, 1]], ClusterDissimilarityFunction -> (Max[#] &)]

enter image description here

The thing is that I would need to know much more about Japanese to do anything useful here. I do have a colleague who speaks Japanese; I'll try to get some help.

Thank you very much for your comment.

Cheers,

Marco

POSTED BY: Marco Thiel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract