Group Abstract

Message Boards

WOLFRAM COMMUNITY

9.1K Views

3 Replies

2 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science

Why does Classify only guess according to number of items of each type?

Daniel Janzon

Posted 10 years ago

I'm new to Mathematica and just wanted to try out the Classify function by trying to classify Swedish male vs female names. It turns out it doesn't even look at the training data; the classification is performed solely on the basis of the number of pieces training data of each type. My code: femaleNames = {"Amanda", "Anna", "Emma", "Erika", "Elin", "Hanna", "Karin", "Marie", "Sara", "Viola"}; maleNames = {"Anders", "Bo", "Erik", "Göran", "Jakob", "Martin", "Knut", "Sune", "Tim", "Örjan"}; trainingData = Join[Table[name -> True, {name, femaleNames}], Table[name -> False, {name, maleNames}]]; c = Classify[trainingData]; c["Daniel", "Probabilities"] <\|False -> 0.5, True -> 0.5\|> The last row shows the probabilities is 0.5 since I have the same number of names of each class. If I change the size of e.g. the femaleNames set, the probabilites changes accordingly. I tried the various classification methods and they all fail in the same way. What am I missing?

POSTED BY: Daniel Janzon

3 Replies

Sort By:

Etienne Bernard

Etienne Bernard, NuMind

Posted 10 years ago

In that case, Classify interprets the input as text and uses words as features. Since there is only one word per examples, it won't be able to generalise. What you can do is extract features out of the names. For example you can construct a function that extract the the last letter and the last two letters: features[name_] := {StringTake[name, -1], StringTake[name, -2]} Then construct a training set: femaleNames = features /@ {"Amanda", "Anna", "Emma", "Erika", "Elin", "Hanna", "Karin", "Marie", "Sara", "Viola"}; maleNames = features /@ {"Anders", "Bo", "Erik", "Göran", "Jakob", "Martin", "Knut", "Sune", "Tim", "Örjan"}; trainingData = Join[Thread[femaleNames -> "Female"], Thread[maleNames -> "Male"]] {{"a", "da"} -> "Female", {"a", "na"} -> "Female", {"a", "ma"} -> "Female", {"a", "ka"} -> "Female", {"n", "in"} -> "Female", {"a", "na"} -> "Female", {"n", "in"} -> "Female", {"e", "ie"} -> "Female", {"a", "ra"} -> "Female", {"a", "la"} -> "Female", {"s", "rs"} -> "Male", {"o", "Bo"} -> "Male", {"k", "ik"} -> "Male", {"n", "an"} -> "Male", {"b", "ob"} -> "Male", {"n", "in"} -> "Male", {"t", "ut"} -> "Male", {"e", "ne"} -> "Male", {"m", "im"} -> "Male", {"n", "an"} -> "Male"} You can then train the classifier and test it: In[87]:= c = Classify[trainingData]; test = {"Daniel", "Karina", "Sofie", "Josefina", "Sissela", "Sven", "Erik"}; Thread[test -> c[features /@ test]] Out[89]= {"Daniel" -> "Male", "Karina" -> "Female", "Sofie" -> "Female", "Josefina" -> "Female", "Sissela" -> "Female", "Sven" -> "Male", "Erik" -> "Male"} It got all the Swedish names I know right, but to make it better, you probably need more data and better features.

In that case, Classify interprets the input as text and uses words as features. Since there is only one word per examples, it won't be able to generalise.

What you can do is extract features out of the names. For example you can construct a function that extract the the last letter and the last two letters:

features[name_] := {StringTake[name, -1], StringTake[name, -2]}

Then construct a training set:

femaleNames = 
  features /@ {"Amanda", "Anna", "Emma", "Erika", "Elin", "Hanna", 
    "Karin", "Marie", "Sara", "Viola"};
maleNames = 
  features /@ {"Anders", "Bo", "Erik", "Göran", "Jakob", "Martin", 
    "Knut", "Sune", "Tim", "Örjan"};
trainingData = 
 Join[Thread[femaleNames -> "Female"], Thread[maleNames -> "Male"]]

{{"a", "da"} -> "Female", {"a", "na"} -> "Female", {"a", "ma"} -> 
  "Female", {"a", "ka"} -> "Female", {"n", "in"} -> 
  "Female", {"a", "na"} -> "Female", {"n", "in"} -> 
  "Female", {"e", "ie"} -> "Female", {"a", "ra"} -> 
  "Female", {"a", "la"} -> "Female", {"s", "rs"} -> 
  "Male", {"o", "Bo"} -> "Male", {"k", "ik"} -> "Male", {"n", "an"} ->
   "Male", {"b", "ob"} -> "Male", {"n", "in"} -> 
  "Male", {"t", "ut"} -> "Male", {"e", "ne"} -> "Male", {"m", "im"} ->
   "Male", {"n", "an"} -> "Male"}

You can then train the classifier and test it:

In[87]:= c = Classify[trainingData];
test = {"Daniel", "Karina", "Sofie", "Josefina", "Sissela", "Sven", 
   "Erik"};
Thread[test -> c[features /@ test]]

Out[89]= {"Daniel" -> "Male", "Karina" -> "Female", 
 "Sofie" -> "Female", "Josefina" -> "Female", "Sissela" -> "Female", 
 "Sven" -> "Male", "Erik" -> "Male"}

It got all the Swedish names I know right, but to make it better, you probably need more data and better features.

POSTED BY: Etienne Bernard