# Why does Classify only guess according to number of items of each type?

Posted 8 years ago
6296 Views
|
3 Replies
|
2 Total Likes
|
 I'm new to Mathematica and just wanted to try out the Classify function by trying to classify Swedish male vs female names. It turns out it doesn't even look at the training data; the classification is performed solely on the basis of the number of pieces training data of each type. My code: femaleNames = {"Amanda", "Anna", "Emma", "Erika", "Elin", "Hanna", "Karin", "Marie", "Sara", "Viola"}; maleNames = {"Anders", "Bo", "Erik", "Göran", "Jakob", "Martin", "Knut", "Sune", "Tim", "Örjan"}; trainingData = Join[Table[name -> True, {name, femaleNames}], Table[name -> False, {name, maleNames}]]; c = Classify[trainingData]; c["Daniel", "Probabilities"] <|False -> 0.5, True -> 0.5|> The last row shows the probabilities is 0.5 since I have the same number of names of each class. If I change the size of e.g. the femaleNames set, the probabilites changes accordingly. I tried the various classification methods and they all fail in the same way.What am I missing?
3 Replies
Sort By:
Posted 8 years ago
 In that case, Classify interprets the input as text and uses words as features. Since there is only one word per examples, it won't be able to generalise.What you can do is extract features out of the names. For example you can construct a function that extract the the last letter and the last two letters: features[name_] := {StringTake[name, -1], StringTake[name, -2]} Then construct a training set: femaleNames = features /@ {"Amanda", "Anna", "Emma", "Erika", "Elin", "Hanna", "Karin", "Marie", "Sara", "Viola"}; maleNames = features /@ {"Anders", "Bo", "Erik", "Göran", "Jakob", "Martin", "Knut", "Sune", "Tim", "Örjan"}; trainingData = Join[Thread[femaleNames -> "Female"], Thread[maleNames -> "Male"]] {{"a", "da"} -> "Female", {"a", "na"} -> "Female", {"a", "ma"} -> "Female", {"a", "ka"} -> "Female", {"n", "in"} -> "Female", {"a", "na"} -> "Female", {"n", "in"} -> "Female", {"e", "ie"} -> "Female", {"a", "ra"} -> "Female", {"a", "la"} -> "Female", {"s", "rs"} -> "Male", {"o", "Bo"} -> "Male", {"k", "ik"} -> "Male", {"n", "an"} -> "Male", {"b", "ob"} -> "Male", {"n", "in"} -> "Male", {"t", "ut"} -> "Male", {"e", "ne"} -> "Male", {"m", "im"} -> "Male", {"n", "an"} -> "Male"} You can then train the classifier and test it: In[87]:= c = Classify[trainingData]; test = {"Daniel", "Karina", "Sofie", "Josefina", "Sissela", "Sven", "Erik"}; Thread[test -> c[features /@ test]] Out[89]= {"Daniel" -> "Male", "Karina" -> "Female", "Sofie" -> "Female", "Josefina" -> "Female", "Sissela" -> "Female", "Sven" -> "Male", "Erik" -> "Male"} It got all the Swedish names I know right, but to make it better, you probably need more data and better features.
Posted 8 years ago
 Thanks! That was a good explanation.
Posted 8 years ago
 That's a very good solution. I imagine this sort of phoneme classification could be applied to other areas as well.