Message Boards Message Boards

Classifying Japanese characters from the Edo period

GROUPS:

Introduction

I recently came across a post for a computer program that in some fields intends to compete with the Wolfram Language and which has a toolbox for Machine Learning. I wanted to compare the work described with the workflow in Mathematica. The challenge is to classify old Japanese Characters from texts from the so-called Edo period:

Style[StringTake[WikipediaData["Edo period"], 601], 16]

enter image description here

The characters are written in running style and apparently are difficult to read for Japanese speakers today. We will use two approaches of machine learning in the Wolfram Language to tackle this problem.

Historical Background

Before we proceed, let's have a look at the Tokugawa shogunate. I will use code borrow from @Vitaliy Kaurov from his great article on "Computational history: countries that are gone". The following lines compute the historic boundaries of the shogunate:

Monitor[tokugawaPOLY = 
   Table[EntityValue[Entity["HistoricalCountry", "TokugawaShogunate"],
      EntityProperty["HistoricalCountry", 
      "Polygon", {"Date" -> DateObject[{t}]}]], {t, 1600, 1870}];, t]

We then only filter for "changes" in the boundaries:

tokugawaPOLY // Length
tokugawaPOLYcomp = 
  DeleteMissing[
   DeleteDuplicates[Transpose[{Range[1600, 1870], tokugawaPOLY}], 
    Last[#1] == Last[#2] &], 1, 2];
tokugawaPOLYcomp // Length

In the 271 I consider in the first place there are only 4 changes to the boarders, which we can plot like so:

GeoGraphics[{EdgeForm[Red], GeoStyling[Opacity[.1]], #} & /@ 
  tokugawaPOLYcomp[[All, 2]], GeoProjection -> "Mercator", 
 ImageSize -> 800, GeoBackground -> GeoStyling["StreetMap"], 
 GeoRange -> Entity["Country", "Japan"], 
 GeoRangePadding -> Quantity[800, "Kilometers"], GeoZoomLevel -> 6]

enter image description here

So the boarders were indeed very stable for a long period reflecting the "no more wars" philosophy of the day.

Arts and culture

The stability led to a very rich art and literature scene. Here are some images that reflect the style of the time:

ImageCollage["Thumbnail" /. Normal[Normal[WebImageSearch["Edo Period"]]], Method -> "Rows"]

enter image description here

Our dataset

Also literature flourished during this period. Our dataset will be taken from the Center for Open Data in the Humanities. The keen reader might have noticed that the page is in Japanese. Mathematica can translate the page, and so can Google. The dataset is called the "Japanese classic registered glyph data set"; 日本古典籍字形データセット.

TextTranslation["日本古典籍字形データセット"]

"Japan classics-type datasets". Following the first link (grey box) on the original Japanese page one gets to this page, from the bottom of which (look for a download symbol and the text "(TAR+GZ 24.57 MB)" the dataset can be downloaded. This is the google translation of the reference:

Japanese Classic Statement Character Data Set" (Kokugaku Kenkyu Other Collection / CODH Processing)

which in the original is "日本古典籍字形データセット』(国文研ほか所蔵/CODH加工".

The glyphs come from "bamboo letters" from 15 classic texts; the dataset contains more than 20000 glyphs of the 10 most frequent symbols. These letters were annotated by the "National Institute of Letters of Japan" (google). The dataset was also used in a data/machine learning challenge.

After extracting the file to a folder on the desktop I can import the images with:

characterImages = Import["/Users/thiel/Desktop/pmjt_sample_20161116/train_test_file_list.h5", "Data"];

Here is a sample of the glyphs:

Grid[Partition[ImageAdjust[Image[#/256.]] & /@ characterImages[[3, 1 ;; 24]], 6], Frame -> All]

enter image description here

Similar to the MNIST dataset, they consist of 28 times 28 pixels. They give the grey level as integers, which is why I divide by 256. This shows the pixelation of the glyphs:

Image[characterImages[[3, 12]]/256.] // ImageAdjust

enter image description here

The dataset conveniently contains a training and a test set. There are

characterImages[[3]] // Length

19909 glyphs in the training set. All of these are annotated:

Flatten[characterImages[[4]]]

enter image description here

We can convert the training data to images like so:

Monitor[trainingset = Table[ImageAdjust[Image[characterImages[[3, k]]/256.]] -> Flatten[characterImages[[4]]][[k]], {k, 1, 19909}];, k]

The test set contains

characterImages[[1]] // Length

3514 glyphs. The annotation of which are found in

characterImages[[2]] // Flatten // Short

enter image description here

This is how we prepare the test set data for our machine learning:

Monitor[testset = Table[ImageAdjust[Image[characterImages[[1, k]]/256.]] -> Flatten[characterImages[[2]]][[k]], {k, 1, 3514}];, k]

What are the 10 most frequent symbols?

In one of the annotation files we find the character codes of the symbols so that we can map the classification to actual symbols. These are the glyphs we will consider:

Rasterize /@ (Style[
     FromCharacterCode[
      ToExpression["16^^" <> StringJoin[Characters[#]]]], 
     100] & /@ {"3057", "306B", "306E", "3066", "308A", "3092", 
    "304B", "304F", "304D", "3082"})

enter image description here

We can also make an image collage of them:

ImageCollage[
 Image /@ (Rasterize /@ (Style[
        FromCharacterCode[
         ToExpression["16^^" <> StringJoin[Characters[#]]]], 
        100] & /@ {"3057", "306B", "306E", "3066", "308A", "3092", 
       "304B", "304F", "304D", "3082"})), Method -> "Rows"]

enter image description here

I though it would be nice if they were in a more calligraphic form. So I tried to used ImageRestyle to try and make them more calligraphic. I think that the attempt was not very successful, but I want to show it anyway.

First I download a couple of calligraphic symbols:

caligraphy = "Thumbnail" /. Normal[Normal[WebImageSearch["japanese calligraphy"]]]

I liked the 5th one, which I will use for the style transfer:

calisymbols = 
 ImageRestyle[#, caligraphy[[5]], 
    PerformanceGoal -> 
     "Quality"] & /@ (Rasterize /@ (Style[
        FromCharacterCode[
         ToExpression["16^^" <> StringJoin[Characters[#]]]], 
        100] & /@ {"3057", "306B", "306E", "3066", "308A", "3092", 
       "304B", "304F", "304D", "3082"}))

enter image description here

We can also remove the background to get an alternative representation:

enter image description here

Machine Learning (Classify)

Our first approach to classify the glyphs will be via Classify. The classification is very fast (only a couple of seconds on my machine) and quite accurate:

standardcl = Classify[trainingset, ValidationSet -> testset, PerformanceGoal -> "Quality"]

We can first calculate the ClassifierMeasurements

cmstandard = ClassifierMeasurements[standardcl, testset]

and the accuracy:

cmstandard["Accuracy"]

With 92.5% this is quite accurate and also much faster than the competitor product. The confusion plot shows that it is a very reasonable accuracy - in particular when considering that native speakers have problems identifying the glyphs.

cmstandard["ConfusionMatrixPlot"]

enter image description here

Here are some glyphs that the Classifier identified as class "0":

cmstandard["Examples" -> {0, 0}][[1 ;; 20]]

enter image description here

They look quite diverse, but so does the original set:

RandomSample[Select[trainingset, #[[2]] == 0 &], 20]

enter image description here

It is quite impressive that the Classify function manages to achieve this level of accuracy. We can, however, do better!

Machine Learning (NetTrain)

We can improve the accuracy by training our own custom made network, which in this case will be the standard net that is often used for the MNIST dataset in Mathematica. It turns out that we do not even need a GPU for the training.

lenet = NetChain[{ConvolutionLayer[20, 5], Ramp, PoolingLayer[2, 2], 
   ConvolutionLayer[50, 5], Ramp, PoolingLayer[2, 2], FlattenLayer[], 
   500, Ramp, 10, SoftmaxLayer[]}, 
  "Output" -> NetDecoder[{"Class", Range[0, 9]}], 
  "Input" -> NetEncoder[{"Image", {28, 28}, "Grayscale"}]]

enter image description here

This we can then train:

lenet = NetTrain[lenet, trainingset, ValidationSet -> testset, MaxTrainingRounds -> 20];

enter image description here

The training is done in about 4:30 minutes, but it appears that 20 training rounds are not necessary. Here are some results when applying the classifier:

imgs = Keys@RandomSample[testset, 60]; Thread[imgs -> lenet[imgs]]

enter image description here

Let's see how well it does:

cm = ClassifierMeasurements[lenet, testset]
cm["ConfusionMatrixPlot"]

enter image description here

This corresponds to an accuracy of

cm["Accuracy"]

95.6%, which is really impressive given the dataset. The following table illustrates the difficulty:

Grid[Table[
Prepend[Select[trainingset, #[[2]] == k &][[1 ;; 10]][[All, 1]], 
ImageResize[calisymbols[[k + 1]], 28]], {k, 0, 9}], Frame -> All, 
Background -> {{Red}, None}]

enter image description here

The first (red) column is derived from the glyph based on the character code - using our "calligraphy-style". The remaining columns show items from the training set, i.e. glyphs that were manually classified to belong to the given group.

Conclusion

I do not speak/read Japanese and have no more than Wikipedia knowledge about the Edo period. I still find it quite amazing that using the Wolfram Language's high level Classify function and one of the pre-implemented networks, we can achieve quite a remarkable classification. I am sure that the network can be improved. I would also like to see a more complete dataset with more than 10 different symbols.

POSTED BY: Marco Thiel
Answer
9 days ago

It'll take me a while to download the dataset and check, but it looks like there's a lot of hentaigana. I can only recognize some of them. https://en.wikipedia.org/wiki/Hentaigana

To summarize, in older Japanese there are a lot of possible variant characters that can be used to represent the same sound. So your learning task is made much harder because of these. If you could sort them out and learn the variant characters, you'd probably get even better results.

POSTED BY: Sean Clarke
Answer
9 days ago

Thank you very much for your suggestion. The issue that I do not speak Japanese becomes quite problematic here. I did wonder why there was such a variety of characters for the "same" symbol, i.e. last table in my post, first row. They look very different, but I have no feeling for how different they are supposed to look.

I do agree that introducing further categories/classes might help. The thing is that the dataset came with this annotation and I have no idea how to do that manually.

I believe that FeatureSpacePlot might help. For example for the first character in the set (with the same variables as in my OP) we get:

FeatureSpacePlot[Select[trainingset, #[[2]] == 0 &][[1 ;; 200, 1]], ImageSize -> Full]

enter image description here

suggests that there are different sub-classes describing the same symbol. A dendrogram might be useful to find different subclasses, too.

Dendrogram[Select[trainingset, #[[2]] == 0 &][[1 ;; 50, 1]], ClusterDissimilarityFunction -> (Max[#] &)]

enter image description here

The thing is that I would need to know much more about Japanese to do anything useful here. I do have a colleague who speaks Japanese; I'll try to get some help.

Thank you very much for your comment.

Cheers,

Marco

POSTED BY: Marco Thiel
Answer
9 days ago

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: Moderation Team
Answer
9 days ago

Wonderful work, Marco, thank you for sharing! Have you noticed by any chance what method Classify picked automatically? It can be found in the Classify icon:

enter image description here

POSTED BY: Vitaliy Kaurov
Answer
9 days ago

Dear Vitaliy,

yes, indeed. It uses NearestNeighbours. It is quite astonishing that it does so well and is so fast.

I think I will try to explore Sean's suggestion a bit more. I have been speaking with a Chinese colleague (a very experienced calligrapher) and will try to speak to a Japanese colleague tomorrow to see whether there is a way of creating other, similar datasets. I got lots of pages most of which in Chinese. Luckily Google/Mathematica can translate that for me.

It appears that in Chinese there are different calligraphy "schools" or fractions. I think that it would be interesting if we could distinguish automatically between calligraphic glyphs of different schools.

There also seems to be a substantial "evolution" of symbols. It would be cool to follow individual symbols over time and see how they morph into one another.

All the best from Aberdeen,

Marco

POSTED BY: Marco Thiel
Answer
9 days ago

Here is a method that is a bit like using Nearest on images (in that the internal code for that has some similarities). I crib from code located here. The idea quite abbreviated, is to take a sub-array of low frequency Fourier components (I use a DCT for this), flatten these arrays into vectors, extract a singular value decomposition keeping some number of singular values, and use the result to (1) preprocess and (2) look up the test images. We find several "nearest" ones and assign a score by weighting by inverse of proximity to lookup vector.

nearestImages[ilist_, vals_, dn_, dnum_, keep_] :=
 Module[
  {images = ilist, dcts, top,
   topvecs, uu, ww, vv, udotw, norms},
  dcts = Map[FourierDCT[# - Mean[Flatten[#]], dnum] &, images];
  top = dcts[[All, 1 ;; dn, 1 ;; dn]];
  topvecs = Map[Flatten, top];
  topvecs = Map[# &, topvecs];
  {uu, ww, vv} =
   SingularValueDecomposition[topvecs, keep];
  udotw = uu.ww;
  norms = Map[Sqrt[#.#] &, udotw];
  udotw = udotw/norms;
  {Nearest[udotw -> Transpose[{udotw, vals}]], vv}]

processInput[ilist_, vv_, dn_, dnum_] :=
 Module[
  {images = ilist, dcts, top,
   topvecs, tdotv, norms},
  dcts = Map[FourierDCT[# - Mean[Flatten[#]], dnum] &, images];
  top = dcts[[All, 1 ;; dn, 1 ;; dn]];
  topvecs = Map[Flatten, top];
  topvecs = Map[# &, topvecs];
  tdotv = topvecs.vv;
  norms = Map[Sqrt[#.#] &, tdotv];
  tdotv = tdotv/norms;
  tdotv]

guesses[nf_,tvecs_,n_]:=Module[
{nbrs,probs,probsB,bestvals},
probs=Table[
Module[{res=nf[tvecs[[j]],n],dists},
dists=1/Map[Norm[tvecs[[j]]-#,3/2]&,res[[All,1]]];
Thread[{res[[All,2]],dists/Total[dists]}]],
{j,Length[tvecs]}];
probsB=Map[Normal[GroupBy[#,First]]&,probs]/.(val_->vlist:{{val_,_}..}):>(val->Total[vlist[[All,2]]]);
probs=(Range[0,9]/.probsB)/.Thread[Range[0,9]->0];
bestvals=Map[First[Ordering[#,1,Greater]]&,probs,{1}]-1;
bestvals
]

correct[guess_,actual_]/;
Length[guess]==Length[actual]:=
Count[guess-actual,0]
correct[__]:=$Failed

The example proceeds from the point of having imported the arrays into characterImages, as in the original post. We separate inot training and test image arrays and label sets.

trainImages = characterImages[[3, All]]/256.;
trainLabels = Flatten[characterImages[[4, All]]];
testImages = characterImages[[1, All]]/256.;
testLabels = Flatten[characterImages[[2, All]]];

The method has some tuning parameters. The ones used below are in the general vicinity of what is used in the tests at the link given above. We use four neighbors although some experiments indicate 3 might be a better choice for this particular data set. Total run time is a few seconds.

keep = 40;
dn = 20;
dst = 4;
AbsoluteTiming[{nf, vv} =
   nearestImages[trainImages,
    trainLabels, dn, dst, keep];]
AbsoluteTiming[testvecs =
   processInput[testImages, vv, dn, dst];]
guessed = guesses[nf, testvecs, 4];
AbsoluteTiming[corr = correct[guessed, testLabels]]
N[corr/tlen]

(* Out[452]= {2.221543, Null}

Out[453]= {0.114258, Null}

Out[454]= {1.296956, Null}

Out[456]= 0.942231075697 *)

So 94.2% correct, which is not bad. If we bring the number of retained singular values way up we can hit 95% correct.

POSTED BY: Daniel Lichtblau
Answer
8 days ago

That's not fair! You used your brain and I used someone else's without using mine!

Your approach is beautiful and makes a lot of sense. Also it is much faster than the ML approach. On the other hand, you needed to understand what you are doing, and I could just rely on the Wolfram Language's built-in intelligence. So on a brain-usage scale you win, on a laziness scale I win...

Your method is so fast that we should be able to run a parameter sweep and try to find "optimal" ones.

BTW, I would expect your methods to run very well on the MNIST dataset as well. The Japanese character set has all these problems as Sean describes.

Thank you very much for that idea! I think that it is very interesting to see how "by-hand" methods can be equally/more powerful (and faster).

Cheers,

Marco

POSTED BY: Marco Thiel
Answer
8 days ago

With what seems to be the best tuning I could manage it gets 98% on MNIST. In contrast the best methods, which I believe are NN-based, hit around 99.7% correct if memory serves. There is a related set from the US Postal Service that is somewhat more challenging, with best methods "only" getting around 98% I think.

I confess I may have borrowed a brain for that particular bit of work.

POSTED BY: Daniel Lichtblau
Answer
7 days ago

I should ask: are you (Marco) certain it is fair game to put the test set in as the ValidationSet option setting? It would seem from the documentation that that will cause parameter values to be optimized for the known results, which in turn might lead to a score that is artificially inflated. But maybe I misunderstand what that option really does.

POSTED BY: Daniel Lichtblau
Answer
8 days ago

Dear Daniel,

you are quite right. I wondered about that myself, but saw this in one of the presentations by one of your guys at the WTC. I have tried the same thing without that option and got very similar results. It would be great to get the opinion of one of your developers.

I just ran it again (without the option)

lenet = NetTrain[lenet, trainingset, MaxTrainingRounds -> 20];

and got:

cm = ClassifierMeasurements[lenet, testset]
cm["Accuracy"]

0.958452, so nearly 96%. I would love to get some feedback about the ValidationSet option. Thank you for bringing this up.

Best wishes and thanks,

Marco

PS:From the documentation: "ValidationSet->data is typically used when the data in the training set and the data that one wishes to predict or classify come from different sources." That suggests that I should not use the option; but I do not fully understand that scenario. Luckily, in this case it does not make any difference to the main result.

POSTED BY: Marco Thiel
Answer
8 days ago

Group Abstract Group Abstract