Message Boards Message Boards

3 Replies
2 Total Likes
View groups...
Share this post:

Why does Classify only guess according to number of items of each type?

Posted 9 years ago

I'm new to Mathematica and just wanted to try out the Classify function by trying to classify Swedish male vs female names. It turns out it doesn't even look at the training data; the classification is performed solely on the basis of the number of pieces training data of each type. My code:

femaleNames = {"Amanda", "Anna", "Emma", "Erika", "Elin", "Hanna", 
   "Karin", "Marie", "Sara", "Viola"};
maleNames = {"Anders", "Bo", "Erik", "Göran", "Jakob", "Martin", 
   "Knut", "Sune", "Tim", "Örjan"};
trainingData = 
  Join[Table[name -> True, {name, femaleNames}], 
   Table[name -> False, {name, maleNames}]];
c = Classify[trainingData];
c["Daniel", "Probabilities"]

<|False -> 0.5, True -> 0.5|>

The last row shows the probabilities is 0.5 since I have the same number of names of each class. If I change the size of e.g. the femaleNames set, the probabilites changes accordingly. I tried the various classification methods and they all fail in the same way.

What am I missing?

POSTED BY: Daniel Janzon
3 Replies

In that case, Classify interprets the input as text and uses words as features. Since there is only one word per examples, it won't be able to generalise.

What you can do is extract features out of the names. For example you can construct a function that extract the the last letter and the last two letters:

features[name_] := {StringTake[name, -1], StringTake[name, -2]}

Then construct a training set:

femaleNames = 
  features /@ {"Amanda", "Anna", "Emma", "Erika", "Elin", "Hanna", 
    "Karin", "Marie", "Sara", "Viola"};
maleNames = 
  features /@ {"Anders", "Bo", "Erik", "Göran", "Jakob", "Martin", 
    "Knut", "Sune", "Tim", "Örjan"};
trainingData = 
 Join[Thread[femaleNames -> "Female"], Thread[maleNames -> "Male"]]

{{"a", "da"} -> "Female", {"a", "na"} -> "Female", {"a", "ma"} -> 
  "Female", {"a", "ka"} -> "Female", {"n", "in"} -> 
  "Female", {"a", "na"} -> "Female", {"n", "in"} -> 
  "Female", {"e", "ie"} -> "Female", {"a", "ra"} -> 
  "Female", {"a", "la"} -> "Female", {"s", "rs"} -> 
  "Male", {"o", "Bo"} -> "Male", {"k", "ik"} -> "Male", {"n", "an"} ->
   "Male", {"b", "ob"} -> "Male", {"n", "in"} -> 
  "Male", {"t", "ut"} -> "Male", {"e", "ne"} -> "Male", {"m", "im"} ->
   "Male", {"n", "an"} -> "Male"}

You can then train the classifier and test it:

In[87]:= c = Classify[trainingData];
test = {"Daniel", "Karina", "Sofie", "Josefina", "Sissela", "Sven", 
Thread[test -> c[features /@ test]]

Out[89]= {"Daniel" -> "Male", "Karina" -> "Female", 
 "Sofie" -> "Female", "Josefina" -> "Female", "Sissela" -> "Female", 
 "Sven" -> "Male", "Erik" -> "Male"}

It got all the Swedish names I know right, but to make it better, you probably need more data and better features.

POSTED BY: Etienne Bernard
Posted 9 years ago

Thanks! That was a good explanation.

POSTED BY: Daniel Janzon

That's a very good solution. I imagine this sort of phoneme classification could be applied to other areas as well.

POSTED BY: Jesse Friedman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract