WolframAlpha.com
WolframCloud.com
All Sites & Public Resources...
Products & Services
Wolfram|One
Mathematica
Wolfram|Alpha Notebook Edition
Finance Platform
System Modeler
Wolfram Player
Wolfram Engine
WolframScript
Enterprise Private Cloud
Application Server
Enterprise Mathematica
Wolfram|Alpha Appliance
Enterprise Solutions
Corporate Consulting
Technical Consulting
Wolfram|Alpha Business Solutions
Resource System
Data Repository
Neural Net Repository
Function Repository
Wolfram|Alpha
Wolfram|Alpha Pro
Problem Generator
API
Data Drop
Products for Education
Mobile Apps
Wolfram Player
Wolfram Cloud App
Wolfram|Alpha for Mobile
Wolfram|Alpha-Powered Apps
Services
Paid Project Support
Wolfram U
Summer Programs
All Products & Services »
Technologies
Wolfram Language
Revolutionary knowledge-based programming language.
Wolfram Cloud
Central infrastructure for Wolfram's cloud products & services.
Wolfram Science
Technology-enabling science of the computational universe.
Wolfram Notebooks
The preeminent environment for any technical workflows.
Wolfram Engine
Software engine implementing the Wolfram Language.
Wolfram Natural Language Understanding System
Knowledge-based broadly deployed natural language.
Wolfram Data Framework
Semantic framework for real-world data.
Wolfram Universal Deployment System
Instant deployment across cloud, desktop, mobile, and more.
Wolfram Knowledgebase
Curated computable knowledge powering Wolfram|Alpha.
All Technologies »
Solutions
Engineering, R&D
Aerospace & Defense
Chemical Engineering
Control Systems
Electrical Engineering
Image Processing
Industrial Engineering
Mechanical Engineering
Operations Research
More...
Finance, Statistics & Business Analysis
Actuarial Sciences
Bioinformatics
Data Science
Econometrics
Financial Risk Management
Statistics
More...
Education
All Solutions for Education
Trends
Machine Learning
Multiparadigm Data Science
Internet of Things
High-Performance Computing
Hackathons
Software & Web
Software Development
Authoring & Publishing
Interface Development
Web Development
Sciences
Astronomy
Biology
Chemistry
More...
All Solutions »
Learning & Support
Learning
Wolfram Language Documentation
Fast Introduction for Programmers
Wolfram U
Videos & Screencasts
Wolfram Language Introductory Book
Webinars & Training
Summer Programs
Books
Need Help?
Support FAQ
Wolfram Community
Contact Support
Premium Support
Paid Project Support
Technical Consulting
All Learning & Support »
Company
About
Company Background
Wolfram Blog
Events
Contact Us
Work with Us
Careers at Wolfram
Internships
Other Wolfram Language Jobs
Initiatives
Wolfram Foundation
MathWorld
Computer-Based Math
A New Kind of Science
Wolfram Technology for Hackathons
Student Ambassador Program
Wolfram for Startups
Demonstrations Project
Wolfram Innovator Awards
Wolfram + Raspberry Pi
Summer Programs
More...
All Company »
Search
WOLFRAM COMMUNITY
Connect with users of Wolfram technologies to learn, solve problems and share ideas
Join
Sign In
Dashboard
Groups
People
Message Boards
Answer
(
Unmark
)
Mark as an Answer
GROUPS:
Staff Picks
Data Science
Image Processing
Algebra
Graphics and Visualization
High-Performance Computing
Wolfram Language
Machine Learning
Natural Language Processing
Computational Humanities
Wolfram Function Repository
11
Anton Antonov
Re-exploring the structure of Chinese character images
Anton Antonov, Accendo Data LLC
Posted
4 months ago
2760 Views
|
12 Replies
|
25 Total Likes
Follow this post
|
Re-exploring the structure of Chinese character images
by
Anton Antonov
MathematicaForPrediction at WordPress
MathematicaForPrediction at GitHub
April 2022
Version 0.8
Introduction
In this notebook we show information retrieval and clustering techniques over images of Unicode collection of Chinese characters. Here is the outline of notebook's exposition:
1
.
Get Chinese character images.
2
.
Cluster "image vectors" and demonstrate that the obtained clusters have certain explainability elements.
3
.
Apply Latent Semantic Analysis (LSA) workflow to the character set.
4
.
Show visual thesaurus through a recommender system. (That uses Cosine similarity.)
5
.
Discuss graph and hierarchical clustering using LSA matrix factors.
6
.
Demonstrate approximation of "unseen" character images with an image basis obtained through LSA over a small set of (simple) images.
7
.
Redo character approximation with more "interpretable" image basis.
Remark:
This notebook started as an (extended) comment for the Community discussion
"Exploring structure of Chinese characters through image processing"
, [SH1]. (Hence the title.)
Get Chinese character images
This code is a copy of the code in the
original Community post by Silvia Hao
, [SH1]:
I
n
[
]
:
=
C
l
e
a
r
A
l
l
[
p
i
p
e
,
b
r
a
n
c
h
,
b
r
a
n
c
h
S
e
q
]
p
i
p
e
=
R
i
g
h
t
C
o
m
p
o
s
i
t
i
o
n
;
b
r
a
n
c
h
=
T
h
r
o
u
g
h
@
*
{
#
#
}
&
;
b
r
a
n
c
h
S
e
q
=
p
i
p
e
[
b
r
a
n
c
h
[
#
#
]
,
A
p
p
l
y
[
S
e
q
u
e
n
c
e
]
]
&
;
C
l
e
a
r
A
l
l
[
λ
,
p
p
I
m
g
]
(
*
a
s
s
u
m
e
I
u
s
e
m
y
f
i
r
s
t
m
o
n
i
t
o
r
:
*
)
λ
=
1
/
F
i
r
s
t
[
L
o
o
k
u
p
[
"
S
c
a
l
e
"
]
[
S
y
s
t
e
m
I
n
f
o
r
m
a
t
i
o
n
[
"
D
e
v
i
c
e
s
"
,
"
C
o
n
n
e
c
t
e
d
D
i
s
p
l
a
y
s
"
]
]
]
;
p
p
I
m
g
[
m
a
g
F
_
:
1
]
:
=
F
u
n
c
t
i
o
n
M
o
d
u
l
e
{
n
b
M
g
f
y
=
A
b
s
o
l
u
t
e
C
u
r
r
e
n
t
V
a
l
u
e
[
E
v
a
l
u
a
t
i
o
n
N
o
t
e
b
o
o
k
[
]
,
M
a
g
n
i
f
i
c
a
t
i
o
n
]
,
$
w
i
d
t
h
=
P
a
r
t
[
I
m
a
g
e
D
i
m
e
n
s
i
o
n
s
@
#
,
1
]
}
,
I
m
a
g
e
#
,
I
m
a
g
e
S
i
z
e
-
>
m
a
g
F
λ
$
w
i
d
t
h
n
b
M
g
f
y
I
n
[
]
:
=
M
o
d
u
l
e
[
{
f
s
i
z
e
=
5
0
,
w
i
d
t
h
=
6
4
,
h
e
i
g
h
t
=
6
4
}
,
l
s
C
h
a
r
I
D
s
=
M
a
p
[
F
r
o
m
C
h
a
r
a
c
t
e
r
C
o
d
e
[
#
,
"
U
n
i
c
o
d
e
"
]
&
,
1
6
^
^
4
E
0
0
-
1
+
R
a
n
g
e
[
w
i
d
t
h
h
e
i
g
h
t
]
]
;
]
I
n
[
]
:
=
c
h
a
r
P
a
g
e
=
M
o
d
u
l
e
[
{
f
s
i
z
e
=
5
0
,
w
i
d
t
h
=
6
4
,
h
e
i
g
h
t
=
6
4
}
,
1
6
^
^
4
E
0
0
-
1
+
R
a
n
g
e
[
w
i
d
t
h
h
e
i
g
h
t
]
/
/
p
i
p
e
[
F
r
o
m
C
h
a
r
a
c
t
e
r
C
o
d
e
[
#
,
"
U
n
i
c
o
d
e
"
]
&
,
C
h
a
r
a
c
t
e
r
s
,
P
a
r
t
i
t
i
o
n
[
#
,
w
i
d
t
h
]
&
,
G
r
i
d
[
#
,
B
a
c
k
g
r
o
u
n
d
B
l
a
c
k
,
S
p
a
c
i
n
g
s
{
0
,
0
}
,
I
t
e
m
S
i
z
e
{
1
.
5
,
1
.
2
}
,
A
l
i
g
n
m
e
n
t
{
C
e
n
t
e
r
,
C
e
n
t
e
r
}
,
F
r
a
m
e
A
l
l
,
F
r
a
m
e
S
t
y
l
e
D
i
r
e
c
t
i
v
e
[
R
e
d
,
A
b
s
o
l
u
t
e
T
h
i
c
k
n
e
s
s
[
3
λ
]
]
]
&
,
S
t
y
l
e
[
#
,
W
h
i
t
e
,
f
s
i
z
e
,
F
o
n
t
F
a
m
i
l
y
"
S
o
u
r
c
e
H
a
n
S
a
n
s
C
N
"
,
F
o
n
t
W
e
i
g
h
t
"
E
x
t
r
a
L
i
g
h
t
"
]
&
,
R
a
s
t
e
r
i
z
e
[
#
,
B
a
c
k
g
r
o
u
n
d
B
l
a
c
k
]
&
]
]
;
c
h
a
r
g
r
i
d
=
c
h
a
r
P
a
g
e
/
/
C
o
l
o
r
D
i
s
t
a
n
c
e
[
#
,
R
e
d
]
&
/
/
I
m
a
g
e
[
#
,
"
B
y
t
e
"
]
&
/
/
S
i
g
n
/
/
E
r
o
s
i
o
n
[
#
,
5
]
&
;
l
m
a
t
=
c
h
a
r
g
r
i
d
/
/
M
o
r
p
h
o
l
o
g
i
c
a
l
C
o
m
p
o
n
e
n
t
s
[
#
,
M
e
t
h
o
d
"
B
o
u
n
d
i
n
g
B
o
x
"
,
C
o
r
n
e
r
N
e
i
g
h
b
o
r
s
F
a
l
s
e
]
&
;
c
h
a
r
s
=
C
o
m
p
o
n
e
n
t
M
e
a
s
u
r
e
m
e
n
t
s
[
{
c
h
a
r
P
a
g
e
/
/
C
o
l
o
r
C
o
n
v
e
r
t
[
#
,
"
G
r
a
y
s
c
a
l
e
"
]
&
,
l
m
a
t
}
,
"
M
a
s
k
e
d
I
m
a
g
e
"
,
#
W
i
d
t
h
>
1
0
&
]
/
/
V
a
l
u
e
s
/
/
M
a
p
@
R
e
m
o
v
e
A
l
p
h
a
C
h
a
n
n
e
l
;
c
h
a
r
s
=
M
o
d
u
l
e
[
{
s
i
z
e
=
c
h
a
r
s
/
/
M
a
p
@
I
m
a
g
e
D
i
m
e
n
s
i
o
n
s
/
/
M
a
x
}
,
I
m
a
g
e
C
r
o
p
[
#
,
{
s
i
z
e
,
s
i
z
e
}
]
&
/
@
c
h
a
r
s
]
;
Here is a sample of the obtained images:
S
e
e
d
R
a
n
d
o
m
[
3
3
]
;
R
a
n
d
o
m
S
a
m
p
l
e
[
c
h
a
r
s
,
5
]
O
u
t
[
]
=
,
,
,
,
Vector representation of images
Define a function that represents an image into a linear vector space (of pixels):
I
n
[
]
:
=
C
l
e
a
r
[
I
m
a
g
e
T
o
V
e
c
t
o
r
]
;
I
m
a
g
e
T
o
V
e
c
t
o
r
[
i
m
g
_
I
m
a
g
e
]
:
=
F
l
a
t
t
e
n
[
I
m
a
g
e
D
a
t
a
[
C
o
l
o
r
C
o
n
v
e
r
t
[
i
m
g
,
"
G
r
a
y
s
c
a
l
e
"
]
]
]
;
I
m
a
g
e
T
o
V
e
c
t
o
r
[
i
m
g
_
I
m
a
g
e
,
i
m
g
S
i
z
e
_
]
:
=
F
l
a
t
t
e
n
[
I
m
a
g
e
D
a
t
a
[
C
o
l
o
r
C
o
n
v
e
r
t
[
I
m
a
g
e
R
e
s
i
z
e
[
i
m
g
,
i
m
g
S
i
z
e
]
,
"
G
r
a
y
s
c
a
l
e
"
]
]
]
;
I
m
a
g
e
T
o
V
e
c
t
o
r
[
_
_
_
]
:
=
$
F
a
i
l
e
d
;
Show how vector represented images look like:
T
a
b
l
e
[
B
l
o
c
k
R
a
n
d
o
m
[
i
m
g
=
R
a
n
d
o
m
C
h
o
i
c
e
[
c
h
a
r
s
]
;
L
i
s
t
P
l
o
t
[
I
m
a
g
e
T
o
V
e
c
t
o
r
[
i
m
g
]
,
F
i
l
l
i
n
g
A
x
i
s
,
P
l
o
t
R
a
n
g
e
-
>
A
l
l
,
P
l
o
t
L
a
b
e
l
i
m
g
,
I
m
a
g
e
S
i
z
e
M
e
d
i
u
m
,
A
s
p
e
c
t
R
a
t
i
o
1
/
6
]
,
R
a
n
d
o
m
S
e
e
d
i
n
g
r
s
]
,
{
r
s
,
{
3
3
,
9
9
8
}
}
]
O
u
t
[
]
=
,
Data preparation
In this section we represent the images into a linear vector space. (In which each pixel is a basis vector.)
Make an association with images:
a
C
I
m
a
g
e
s
=
A
s
s
o
c
i
a
t
i
o
n
T
h
r
e
a
d
[
l
s
C
h
a
r
I
D
s
-
>
c
h
a
r
s
]
;
L
e
n
g
t
h
[
a
C
I
m
a
g
e
s
]
O
u
t
[
]
=
4
0
9
6
Make flat vectors with the images:
A
b
s
o
l
u
t
e
T
i
m
i
n
g
[
a
C
I
m
a
g
e
V
e
c
s
=
P
a
r
a
l
l
e
l
M
a
p
[
I
m
a
g
e
T
o
V
e
c
t
o
r
,
a
C
I
m
a
g
e
s
]
;
]
O
u
t
[
]
=
{
0
.
9
9
8
1
6
2
,
N
u
l
l
}
Do matrix plots a random sample of the image vectors:
S
e
e
d
R
a
n
d
o
m
[
3
2
]
;
M
a
t
r
i
x
P
l
o
t
[
P
a
r
t
i
t
i
o
n
[
#
,
I
m
a
g
e
D
i
m
e
n
s
i
o
n
s
[
a
C
I
m
a
g
e
s
〚
1
〛
]
〚
2
〛
]
]
&
/
@
R
a
n
d
o
m
S
a
m
p
l
e
[
a
C
I
m
a
g
e
V
e
c
s
,
6
]
O
u
t
[
]
=
垖
,
埄
,
媭
,
垝
,
偭
,
効
Clustering over the image vectors
In this section we cluster "image vectors" and demonstrate that the obtained clusters have certain explainability elements. Expected Chinese character radicals are observed using image multiplication.
Cluster the image vectors and show summary of the clusters lengths:
S
p
a
r
s
e
A
r
r
a
y
[
V
a
l
u
e
s
@
a
C
I
m
a
g
e
V
e
c
s
]
O
u
t
[
]
=
S
p
a
r
s
e
A
r
r
a
y
S
p
e
c
i
f
i
e
d
e
l
e
m
e
n
t
s
:
4
6
5
8
6
9
8
D
i
m
e
n
s
i
o
n
s
:
{
4
0
9
6
,
2
5
0
0
}
D
a
t
a
n
o
t
i
n
n
o
t
e
b
o
o
k
.
S
t
o
r
e
n
o
w
S
e
e
d
R
a
n
d
o
m
[
3
3
4
]
;
A
b
s
o
l
u
t
e
T
i
m
i
n
g
[
l
s
C
l
u
s
t
e
r
s
=
F
i
n
d
C
l
u
s
t
e
r
s
[
S
p
a
r
s
e
A
r
r
a
y
[
V
a
l
u
e
s
@
a
C
I
m
a
g
e
V
e
c
s
]
-
>
K
e
y
s
[
a
C
I
m
a
g
e
V
e
c
s
]
,
3
5
,
M
e
t
h
o
d
-
>
{
"
K
M
e
a
n
s
"
}
]
;
]
L
e
n
g
t
h
@
l
s
C
l
u
s
t
e
r
s
R
e
s
o
u
r
c
e
F
u
n
c
t
i
o
n
[
"
R
e
c
o
r
d
s
S
u
m
m
a
r
y
"
]
[
L
e
n
g
t
h
/
@
l
s
C
l
u
s
t
e
r
s
]
O
u
t
[
]
=
{
2
4
.
6
3
8
3
,
N
u
l
l
}
O
u
t
[
]
=
3
5
O
u
t
[
]
=
1
c
o
l
u
m
n
1
M
i
n
1
7
1
s
t
Q
u
7
9
.
7
5
M
e
a
n
1
1
7
.
0
2
9
M
e
d
i
a
n
1
1
8
3
r
d
Q
u
1
4
6
.
7
5
M
a
x
2
2
5
For each cluster:
◼
Take 30 different small samples of 7 images
◼
Multiply the images in each small sample
◼
Show three "most black" the multiplication results
S
e
e
d
R
a
n
d
o
m
[
3
3
]
;
T
a
b
l
e
[
i
-
>
T
a
k
e
L
a
r
g
e
s
t
B
y
[
T
a
b
l
e
[
I
m
a
g
e
M
u
l
t
i
p
l
y
@
@
R
a
n
d
o
m
S
a
m
p
l
e
[
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
l
s
C
l
u
s
t
e
r
s
〚
i
〛
]
,
U
p
T
o
[
7
]
]
,
3
0
]
,
T
o
t
a
l
@
I
m
a
g
e
T
o
V
e
c
t
o
r
[
#
]
&
,
3
]
,
{
i
,
L
e
n
g
t
h
[
l
s
C
l
u
s
t
e
r
s
]
}
]
O
u
t
[
]
=
1
,
,
,
2
,
,
,
3
,
,
,
4
,
,
,
5
,
,
,
6
,
,
,
7
,
,
,
8
,
,
,
9
,
,
,
1
0
,
,
,
1
1
,
,
,
1
2
,
,
,
1
3
,
,
,
1
4
,
,
,
1
5
,
,
,
1
6
,
,
,
1
7
,
,
,
1
8
,
,
,
1
9
,
,
,
2
0
,
,
,
2
1
,
,
,
2
2
,
,
,
2
3
,
,
,
2
4
,
,
,
2
5
,
,
,
2
6
,
,
,
2
7
,
,
,
2
8
,
,
,
2
9
,
,
,
3
0
,
,
,
3
1
,
,
,
3
2
,
,
,
3
3
,
,
,
3
4
,
,
,
3
5
,
,
Remark:
We can see that the clustering above produced "semantic" clusters -- most of the multiplied images show meaningful Chinese characters radicals and their "expected positions."
Here is one of the clusters with the radical "mouth":
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
l
s
C
l
u
s
t
e
r
s
〚
2
6
〛
]
O
u
t
[
]
=
卟
,
収
,
叨
,
叫
,
叮
,
叱
,
叶
,
叹
,
叺
,
叻
,
叼
,
叿
,
吀
,
吁
,
吃
,
吅
,
吆
,
吇
,
吋
,
吐
,
吒
,
吓
,
吖
,
吗
,
吘
,
吜
,
吟
,
吥
,
吧
,
听
,
吭
,
吽
,
呀
,
呌
,
呍
,
呓
,
呕
,
呛
,
呜
,
呞
,
呟
,
呧
,
呪
,
呫
,
呮
,
呯
,
呵
,
呷
,
呸
,
呺
,
呾
,
咀
,
咊
,
咋
,
咍
,
咓
,
咔
,
咟
,
咡
,
咥
,
咶
,
咺
,
哏
,
哙
,
哣
,
哩
,
哹
,
哻
,
唂
,
唈
,
唔
,
唱
,
啀
,
喣
,
屔
LSAMon application
In this section we apply the "standard" LSA workflow, [AA1, AA4].
Make a matrix with named rows and columns from the image vectors:
m
a
t
=
T
o
S
S
p
a
r
s
e
M
a
t
r
i
x
[
S
p
a
r
s
e
A
r
r
a
y
[
V
a
l
u
e
s
@
a
C
I
m
a
g
e
V
e
c
s
]
,
"
R
o
w
N
a
m
e
s
"
-
>
K
e
y
s
[
a
C
I
m
a
g
e
V
e
c
s
]
,
"
C
o
l
u
m
n
N
a
m
e
s
"
-
>
A
u
t
o
m
a
t
i
c
]
O
u
t
[
]
=
S
p
a
r
s
e
A
r
r
a
y
S
p
e
c
i
f
i
e
d
e
l
e
m
e
n
t
s
:
4
6
5
8
6
9
8
D
i
m
e
n
s
i
o
n
s
:
{
4
0
9
6
,
2
5
0
0
}
D
a
t
a
n
o
t
i
n
n
o
t
e
b
o
o
k
.
S
t
o
r
e
n
o
w
The following Latent Semantic Analysis (LSA) monadic pipeline is used in [AA2, AA2]:
S
e
e
d
R
a
n
d
o
m
[
7
7
]
;
A
b
s
o
l
u
t
e
T
i
m
i
n
g
[
l
s
a
A
l
l
O
b
j
=
L
S
A
M
o
n
U
n
i
t
[
]
⟹
L
S
A
M
o
n
S
e
t
D
o
c
u
m
e
n
t
T
e
r
m
M
a
t
r
i
x
[
m
a
t
]
⟹
L
S
A
M
o
n
A
p
p
l
y
T
e
r
m
W
e
i
g
h
t
F
u
n
c
t
i
o
n
s
[
"
N
o
n
e
"
,
"
N
o
n
e
"
,
"
C
o
s
i
n
e
"
]
⟹
L
S
A
M
o
n
E
x
t
r
a
c
t
T
o
p
i
c
s
[
"
N
u
m
b
e
r
O
f
T
o
p
i
c
s
"
6
0
,
M
e
t
h
o
d
"
S
V
D
"
,
"
M
a
x
S
t
e
p
s
"
1
5
,
"
M
i
n
N
u
m
b
e
r
O
f
D
o
c
u
m
e
n
t
s
P
e
r
T
e
r
m
"
0
]
⟹
L
S
A
M
o
n
N
o
r
m
a
l
i
z
e
M
a
t
r
i
x
P
r
o
d
u
c
t
[
N
o
r
m
a
l
i
z
e
d
R
i
g
h
t
]
⟹
L
S
A
M
o
n
E
c
h
o
[
S
t
y
l
e
[
"
O
b
t
a
i
n
e
d
b
a
s
i
s
:
"
,
B
o
l
d
,
P
u
r
p
l
e
]
]
⟹
L
S
A
M
o
n
E
c
h
o
F
u
n
c
t
i
o
n
C
o
n
t
e
x
t
[
I
m
a
g
e
A
d
j
u
s
t
[
I
m
a
g
e
[
P
a
r
t
i
t
i
o
n
[
#
,
I
m
a
g
e
D
i
m
e
n
s
i
o
n
s
[
a
C
I
m
a
g
e
s
〚
1
〛
]
〚
1
〛
]
]
]
&
/
@
S
p
a
r
s
e
A
r
r
a
y
[
#
H
]
&
]
;
]
»
O
b
t
a
i
n
e
d
b
a
s
i
s
:
»
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
O
u
t
[
]
=
{
7
.
6
0
8
2
8
,
N
u
l
l
}
Remark:
LSAMon
's corresponding theory and design are discussed in [AA1, AA4]:
Get the representation matrix:
W
2
=
l
s
a
A
l
l
O
b
j
⟹
L
S
A
M
o
n
N
o
r
m
a
l
i
z
e
M
a
t
r
i
x
P
r
o
d
u
c
t
[
N
o
r
m
a
l
i
z
e
d
-
>
R
i
g
h
t
]
⟹
L
S
A
M
o
n
T
a
k
e
W
O
u
t
[
]
=
S
p
a
r
s
e
A
r
r
a
y
S
p
e
c
i
f
i
e
d
e
l
e
m
e
n
t
s
:
2
4
5
7
6
0
D
i
m
e
n
s
i
o
n
s
:
{
4
0
9
6
,
6
0
}
D
a
t
a
n
o
t
i
n
n
o
t
e
b
o
o
k
.
S
t
o
r
e
n
o
w
Get the topics matrix:
H
=
l
s
a
A
l
l
O
b
j
⟹
L
S
A
M
o
n
N
o
r
m
a
l
i
z
e
M
a
t
r
i
x
P
r
o
d
u
c
t
[
N
o
r
m
a
l
i
z
e
d
-
>
R
i
g
h
t
]
⟹
L
S
A
M
o
n
T
a
k
e
H
O
u
t
[
]
=
S
p
a
r
s
e
A
r
r
a
y
S
p
e
c
i
f
i
e
d
e
l
e
m
e
n
t
s
:
1
3
8
0
0
2
D
i
m
e
n
s
i
o
n
s
:
{
6
0
,
2
5
0
0
}
D
a
t
a
n
o
t
i
n
n
o
t
e
b
o
o
k
.
S
t
o
r
e
n
o
w
Cluster the
reduced dimension representations
and show summary of the clusters lengths:
A
b
s
o
l
u
t
e
T
i
m
i
n
g
[
l
s
C
l
u
s
t
e
r
s
=
F
i
n
d
C
l
u
s
t
e
r
s
[
N
o
r
m
a
l
[
S
p
a
r
s
e
A
r
r
a
y
[
W
2
]
]
-
>
R
o
w
N
a
m
e
s
[
W
2
]
,
4
0
,
M
e
t
h
o
d
-
>
{
"
K
M
e
a
n
s
"
}
]
;
]
L
e
n
g
t
h
@
l
s
C
l
u
s
t
e
r
s
R
e
s
o
u
r
c
e
F
u
n
c
t
i
o
n
[
"
R
e
c
o
r
d
s
S
u
m
m
a
r
y
"
]
[
L
e
n
g
t
h
/
@
l
s
C
l
u
s
t
e
r
s
]
O
u
t
[
]
=
{
2
.
3
3
3
3
1
,
N
u
l
l
}
O
u
t
[
]
=
4
0
O
u
t
[
]
=
1
c
o
l
u
m
n
1
M
i
n
2
6
1
s
t
Q
u
5
9
M
e
d
i
a
n
9
6
.
5
M
e
a
n
1
0
2
.
4
3
r
d
Q
u
1
4
2
M
a
x
2
0
0
Show cluster interpretations:
A
b
s
o
l
u
t
e
T
i
m
i
n
g
[
a
A
u
t
o
R
a
d
i
c
a
l
s
=
A
s
s
o
c
i
a
t
i
o
n
@
T
a
b
l
e
[
i
-
>
T
a
k
e
L
a
r
g
e
s
t
B
y
[
T
a
b
l
e
[
I
m
a
g
e
M
u
l
t
i
p
l
y
@
@
R
a
n
d
o
m
S
a
m
p
l
e
[
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
l
s
C
l
u
s
t
e
r
s
〚
i
〛
]
,
U
p
T
o
[
8
]
]
,
3
0
]
,
T
o
t
a
l
@
I
m
a
g
e
T
o
V
e
c
t
o
r
[
#
]
&
,
3
]
,
{
i
,
L
e
n
g
t
h
[
l
s
C
l
u
s
t
e
r
s
]
}
]
;
]
a
A
u
t
o
R
a
d
i
c
a
l
s
O
u
t
[
]
=
{
0
.
8
7
8
4
0
6
,
N
u
l
l
}
O
u
t
[
]
=
1
,
,
,
2
,
,
,
3
,
,
,
4
,
,
,
5
,
,
,
6
,
,
,
7
,
,
,
8
,
,
,
9
,
,
,
1
0
,
,
,
1
1
,
,
,
1
2
,
,
,
1
3
,
,
,
1
4
,
,
,
1
5
,
,
,
1
6
,
,
,
1
7
,
,
,
1
8
,
,
,
1
9
,
,
,
2
0
,
,
,
2
1
,
,
,
2
2
,
,
,
2
3
,
,
,
2
4
,
,
,
2
5
,
,
,
2
6
,
,
,
2
7
,
,
,
2
8
,
,
,
2
9
,
,
,
3
0
,
,
,
3
1
,
,
,
3
2
,
,
,
3
3
,
,
,
3
4
,
,
,
3
5
,
,
,
3
6
,
,
,
3
7
,
,
,
3
8
,
,
,
3
9
,
,
,
4
0
,
,
Using FeatureExtraction
I experimented with clustering and approximation using WL's function
FeatureExtraction
. Result are fairly similar as the above; timings a different (a few times slower.)
Visual thesaurus
In this section we use Cosine similarity to find visual nearest neighbors of Chinese character images.
I
n
[
]
:
=
m
a
t
P
i
x
e
l
s
=
W
e
i
g
h
t
T
e
r
m
s
O
f
S
S
p
a
r
s
e
M
a
t
r
i
x
[
l
s
a
A
l
l
O
b
j
⟹
L
S
A
M
o
n
T
a
k
e
W
e
i
g
h
t
e
d
D
o
c
u
m
e
n
t
T
e
r
m
M
a
t
r
i
x
,
"
I
D
F
"
,
"
N
o
n
e
"
,
"
C
o
s
i
n
e
"
]
;
m
a
t
T
o
p
i
c
s
=
W
e
i
g
h
t
T
e
r
m
s
O
f
S
S
p
a
r
s
e
M
a
t
r
i
x
[
l
s
a
A
l
l
O
b
j
⟹
L
S
A
M
o
n
N
o
r
m
a
l
i
z
e
M
a
t
r
i
x
P
r
o
d
u
c
t
[
N
o
r
m
a
l
i
z
e
d
-
>
L
e
f
t
]
⟹
L
S
A
M
o
n
T
a
k
e
W
,
"
N
o
n
e
"
,
"
N
o
n
e
"
,
"
C
o
s
i
n
e
"
]
;
I
n
[
]
:
=
s
m
r
O
b
j
=
S
M
R
M
o
n
U
n
i
t
[
]
⟹
S
M
R
M
o
n
C
r
e
a
t
e
[
<
|
"
T
o
p
i
c
"
-
>
m
a
t
T
o
p
i
c
s
,
"
P
i
x
e
l
"
-
>
m
a
t
P
i
x
e
l
s
|
>
]
;
Consider the character "團":
a
C
I
m
a
g
e
s
[
"
團
"
]
O
u
t
[
]
=
Here are the nearest neighbors for that character found by using both image topics and image pixels:
(
*
f
o
c
u
s
I
t
e
m
=
R
a
n
d
o
m
C
h
o
i
c
e
[
K
e
y
s
@
a
C
I
m
a
g
e
s
]
;
*
)
f
o
c
u
s
I
t
e
m
=
"
團
"
,
"
仼
"
,
"
呔
"
1
;
s
m
r
O
b
j
⟹
S
M
R
M
o
n
E
c
h
o
[
S
t
y
l
e
[
"
N
e
a
r
e
s
t
n
e
i
g
h
b
o
r
s
b
y
p
i
x
e
l
t
o
p
i
c
s
:
"
,
B
o
l
d
,
P
u
r
p
l
e
]
]
⟹
S
M
R
M
o
n
S
e
t
T
a
g
T
y
p
e
W
e
i
g
h
t
s
[
<
|
"
T
o
p
i
c
"
-
>
1
,
"
P
i
x
e
l
"
-
>
0
|
>
]
⟹
S
M
R
M
o
n
R
e
c
o
m
m
e
n
d
[
f
o
c
u
s
I
t
e
m
,
8
,
"
R
e
m
o
v
e
H
i
s
t
o
r
y
"
-
>
F
a
l
s
e
]
⟹
S
M
R
M
o
n
E
c
h
o
V
a
l
u
e
⟹
S
M
R
M
o
n
E
c
h
o
F
u
n
c
t
i
o
n
V
a
l
u
e
[
A
s
s
o
c
i
a
t
i
o
n
T
h
r
e
a
d
[
V
a
l
u
e
s
@
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
K
e
y
s
[
#
]
]
,
V
a
l
u
e
s
[
#
]
]
&
]
⟹
S
M
R
M
o
n
E
c
h
o
[
S
t
y
l
e
[
"
N
e
a
r
e
s
t
n
e
i
g
h
b
o
r
s
b
y
p
i
x
e
l
s
:
"
,
B
o
l
d
,
P
u
r
p
l
e
]
]
⟹
S
M
R
M
o
n
S
e
t
T
a
g
T
y
p
e
W
e
i
g
h
t
s
[
<
|
"
T
o
p
i
c
"
-
>
0
,
"
P
i
x
e
l
"
-
>
1
|
>
]
⟹
S
M
R
M
o
n
R
e
c
o
m
m
e
n
d
[
f
o
c
u
s
I
t
e
m
,
8
,
"
R
e
m
o
v
e
H
i
s
t
o
r
y
"
-
>
F
a
l
s
e
]
⟹
S
M
R
M
o
n
E
c
h
o
F
u
n
c
t
i
o
n
V
a
l
u
e
[
A
s
s
o
c
i
a
t
i
o
n
T
h
r
e
a
d
[
V
a
l
u
e
s
@
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
K
e
y
s
[
#
]
]
,
V
a
l
u
e
s
[
#
]
]
&
]
;
»
N
e
a
r
e
s
t
n
e
i
g
h
b
o
r
s
b
y
p
i
x
e
l
t
o
p
i
c
s
:
»
v
a
l
u
e
:
團
1
.
,
圑
0
.
8
3
1
7
1
6
,
圕
0
.
8
2
6
4
4
5
,
圊
0
.
7
5
7
8
8
,
圈
0
.
7
3
5
5
7
,
圏
0
.
7
1
8
0
8
8
,
圚
0
.
6
9
1
2
2
3
,
圓
0
.
6
7
0
5
8
9
»
1
.
,
0
.
8
3
1
7
1
6
,
0
.
8
2
6
4
4
5
,
0
.
7
5
7
8
8
,
0
.
7
3
5
5
7
,
0
.
7
1
8
0
8
8
,
0
.
6
9
1
2
2
3
,
0
.
6
7
0
5
8
9
»
N
e
a
r
e
s
t
n
e
i
g
h
b
o
r
s
b
y
p
i
x
e
l
s
:
»
1
.
,
0
.
9
1
4
6
6
1
,
0
.
9
0
4
4
7
4
,
0
.
8
9
0
8
7
9
,
0
.
8
8
9
3
6
3
,
0
.
8
8
1
7
4
6
,
0
.
8
8
1
0
7
1
,
0
.
8
7
9
1
3
9
Remark:
Of course, in the recommender pipeline above we can use both pixels and pixels topics. (With their contributions being weighted.)
Graph clustering
In this section we demonstrate the use of graph communities to find similar groups of Chinese characters.
Here we take a sub-matrix of the reduced dimension matrix computed above:
I
n
[
]
:
=
W
=
l
s
a
A
l
l
O
b
j
⟹
L
S
A
M
o
n
N
o
r
m
a
l
i
z
e
M
a
t
r
i
x
P
r
o
d
u
c
t
[
N
o
r
m
a
l
i
z
e
d
-
>
R
i
g
h
t
]
⟹
L
S
A
M
o
n
T
a
k
e
W
;
Here we find the similarity matrix between the characters and remove entries corresponding to "small" similarities:
I
n
[
]
:
=
m
a
t
S
y
m
=
C
l
i
p
[
W
.
T
r
a
n
s
p
o
s
e
[
W
]
,
{
0
.
7
8
,
1
}
,
{
0
,
1
}
]
;
Here we plot the obtained (clipped) similarity matrix:
M
a
t
r
i
x
P
l
o
t
[
m
a
t
S
y
m
]
O
u
t
[
]
=
Here we:
◼
Take array rules of the sparse similarity matrix
◼
Drop the rules corresponding to the diagonal elements
◼
Convert the keys of rules into uni-directed graph edges
◼
Make the corresponding graph
◼
Find graph's connected components
◼
Show the number of connected components
◼
Show a tally of the number of nodes in the components
g
r
=
G
r
a
p
h
[
U
n
d
i
r
e
c
t
e
d
E
d
g
e
@
@
@
D
e
l
e
t
e
C
a
s
e
s
[
U
n
i
o
n
[
S
o
r
t
/
@
K
e
y
s
[
S
S
p
a
r
s
e
M
a
t
r
i
x
A
s
s
o
c
i
a
t
i
o
n
[
m
a
t
S
y
m
]
]
]
,
{
x
_
,
x
_
}
]
]
;
l
s
C
o
m
p
s
=
C
o
n
n
e
c
t
e
d
C
o
m
p
o
n
e
n
t
s
[
g
r
]
;
L
e
n
g
t
h
[
l
s
C
o
m
p
s
]
R
e
v
e
r
s
e
S
o
r
t
B
y
[
T
a
l
l
y
[
L
e
n
g
t
h
/
@
l
s
C
o
m
p
s
]
,
F
i
r
s
t
]
O
u
t
[
]
=
1
3
8
O
u
t
[
]
=
{
{
1
8
3
9
,
1
}
,
{
3
1
,
1
}
,
{
2
7
,
1
}
,
{
1
6
,
1
}
,
{
1
1
,
2
}
,
{
9
,
2
}
,
{
8
,
1
}
,
{
7
,
1
}
,
{
6
,
5
}
,
{
5
,
3
}
,
{
4
,
8
}
,
{
3
,
1
4
}
,
{
2
,
9
8
}
}
Here we demonstrate the clusters of Chinese characters make sense:
a
P
r
e
t
t
y
R
u
l
e
s
=
D
i
s
p
a
t
c
h
[
M
a
p
[
#
-
>
S
t
y
l
e
[
#
,
F
o
n
t
S
i
z
e
3
6
]
&
,
K
e
y
s
[
a
C
I
m
a
g
e
s
]
]
]
;
C
o
m
m
u
n
i
t
y
G
r
a
p
h
P
l
o
t
[
S
u
b
g
r
a
p
h
[
g
r
,
T
a
k
e
L
a
r
g
e
s
t
B
y
[
l
s
C
o
m
p
s
,
L
e
n
g
t
h
,
1
0
]
〚
2
〛
]
,
M
e
t
h
o
d
-
>
"
S
p
r
i
n
g
E
l
e
c
t
r
i
c
a
l
"
,
V
e
r
t
e
x
L
a
b
e
l
s
-
>
P
l
a
c
e
d
[
"
N
a
m
e
"
,
A
b
o
v
e
]
,
A
s
p
e
c
t
R
a
t
i
o
1
,
I
m
a
g
e
S
i
z
e
1
0
0
0
]
/
.
a
P
r
e
t
t
y
R
u
l
e
s
O
u
t
[
]
=
Remark:
By careful observation of the clusters and graph connections we can convince ourselves that the similarities are based on pictorial sub-elements (i.e. radicals) of the characters.
Hierarchical clustering
In this section we apply hierarchical clustering to the reduced dimension representation of the Chinese character images.
Here we pick a cluster:
l
s
F
o
c
u
s
I
D
s
=
l
s
C
l
u
s
t
e
r
s
〚
1
2
〛
;
M
a
g
n
i
f
y
[
I
m
a
g
e
C
o
l
l
a
g
e
[
V
a
l
u
e
s
[
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
l
s
F
o
c
u
s
I
D
s
]
]
]
,
0
.
4
]
Here is how we can make a dendrogram plot (not that useful here):
I
n
[
]
:
=
(
*
s
m
a
t
=
W
2
〚
l
s
C
l
u
s
t
e
r
s
〚
1
3
〛
,
A
l
l
〛
;
D
e
n
d
r
o
g
r
a
m
[
T
h
r
e
a
d
[
N
o
r
m
a
l
[
S
p
a
r
s
e
A
r
r
a
y
[
s
m
a
t
]
]
-
>
M
a
p
[
S
t
y
l
e
[
#
,
F
o
n
t
S
i
z
e
1
6
]
&
,
R
o
w
N
a
m
e
s
[
s
m
a
t
]
]
]
,
T
o
p
,
D
i
s
t
a
n
c
e
F
u
n
c
t
i
o
n
E
u
c
l
i
d
e
a
n
D
i
s
t
a
n
c
e
]
*
)
Here is a heat-map plot with hierarchical clustering dendrogram (with tool-tips):
g
r
=
H
e
a
t
m
a
p
P
l
o
t
[
W
2
〚
l
s
F
o
c
u
s
I
D
s
,
A
l
l
〛
,
D
i
s
t
a
n
c
e
F
u
n
c
t
i
o
n
-
>
{
C
o
s
i
n
e
D
i
s
t
a
n
c
e
,
N
o
n
e
}
,
D
e
n
d
r
o
g
r
a
m
{
T
r
u
e
,
F
a
l
s
e
}
]
;
g
r
/
.
M
a
p
[
#
-
>
T
o
o
l
t
i
p
[
S
t
y
l
e
[
#
,
F
o
n
t
S
i
z
e
1
6
]
,
S
t
y
l
e
[
#
,
B
o
l
d
,
F
o
n
t
S
i
z
e
3
6
]
]
&
,
l
s
F
o
c
u
s
I
D
s
]
O
u
t
[
]
=
Remark:
The plot above has tooltips with larger character images.
Representing all characters with smaller set of basic ones
In this section we demonstrate that a relatively small set of simpler Chinese character images can be used to represent (or approxumate) the rest of the images.
Remark:
We use the following heuristic: the simpler Chinese characters have the smallest amount of white pixels.
Obtain a training set of images -- that are the darkest -- and show a sample of that set :
{
t
r
a
i
n
i
n
g
I
n
d
s
,
t
e
s
t
i
n
g
I
n
d
s
}
=
T
a
k
e
D
r
o
p
[
K
e
y
s
[
S
o
r
t
B
y
[
a
C
I
m
a
g
e
s
,
T
o
t
a
l
[
I
m
a
g
e
T
o
V
e
c
t
o
r
[
#
]
]
&
]
]
,
8
0
0
]
;
S
e
e
d
R
a
n
d
o
m
[
3
]
;
R
a
n
d
o
m
S
a
m
p
l
e
[
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
t
r
a
i
n
i
n
g
I
n
d
s
]
,
1
2
]
O
u
t
[
]
=
冶
,
孒
,
亍
,
刓
,
伡
,
呂
,
古
,
呔
,
岳
,
冎
,
众
,
宎
Show all training characters with an image collage:
M
a
g
n
i
f
y
[
I
m
a
g
e
C
o
l
l
a
g
e
[
V
a
l
u
e
s
[
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
s
,
t
r
a
i
n
i
n
g
I
n
d
s
]
]
,
B
a
c
k
g
r
o
u
n
d
G
r
a
y
,
I
m
a
g
e
P
a
d
d
i
n
g
1
]
,
0
.
4
]
Apply LSA monadic pipeline with the training characters only:
S
e
e
d
R
a
n
d
o
m
[
7
7
]
;
A
b
s
o
l
u
t
e
T
i
m
i
n
g
[
l
s
a
P
a
r
t
i
a
l
O
b
j
=
L
S
A
M
o
n
U
n
i
t
[
]
⟹
L
S
A
M
o
n
S
e
t
D
o
c
u
m
e
n
t
T
e
r
m
M
a
t
r
i
x
[
S
p
a
r
s
e
A
r
r
a
y
[
V
a
l
u
e
s
@
K
e
y
T
a
k
e
[
a
C
I
m
a
g
e
V
e
c
s
,
t
r
a
i
n
i
n
g
I
n
d
s
]
]
]
⟹
L
S
A
M
o
n
A
p
p
l
y
T
e
r
m
W
e
i
g
h
t
F
u
n
c
t
i
o
n
s
[
"
N
o
n
e
"
,
"
N
o
n
e
"
,
"
C
o
s
i
n
e
"
]
⟹
L
S
A
M
o
n
E
x
t
r
a
c
t
T
o
p
i
c
s
[
"
N
u
m
b
e
r
O
f
T
o
p
i
c
s
"
8
0
,
M
e
t
h
o
d
"
S
V
D
"
,
"
M
a
x
S
t
e
p
s
"
1
2
0
,
"
M
i
n
N
u
m
b