Group Abstract

Message Boards

WOLFRAM COMMUNITY

5.9K Views

8 Replies

3 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Wolfram Language Machine Learning

Force Classify to treat data as numeric discrete count instead of Boolean?

Jamie Dixson

Jamie Dixson, UNT

Posted 7 years ago

I am attempting to use "Classify" to classify based on two classes. I can use data that are clearly numeric like 1.234 with no problem. However, one of my datasets is discrete count data and that is causing a problem. Most of the data points are either 0 or 1 with a few higher numbers scattered throughout (See example below). Mathematica automatically selects "mixed input from the training set but when I enter the test data, Mathematica assumes it is boolean. When it gets to the first number >1, it throws an error (See below). I need to force it to recognize all of the data points in the training data as numeric and all of the points in the test data as numeric. The test and training data files are attached below. How can this be done? Example Data: TrainingData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}-> "A",{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}->"B"} 'My datasets are actually much larger than this with 8000 features but this is just an example nn=Classify[TrainingData] 'This works fine but sets input to "Mixed" TestData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}, {0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}} nn[TestData] 'This throws the error ClassifierFunction::mlincfttp: Incompatible variable type (Boolean) and variable value (2). Attachments: TestData.txt TrainingData.txt

POSTED BY: Jamie Dixson

8 Replies

Sort By:

Rohit Namjoshi

Posted 7 years ago

I tried the fix that Jamie suggested and it does eliminate the error. However I don't think it explains why training on a subset of the training data also works even though the subset contains even fewer unique values. Convert to numeric classifier = Classify[N[training]] classifier[testing] All testing samples are classified as P, is that expected? "P", "P", "P", "P", "P", "P", "P"} The probabilities for P and NP are extremely close for most of the samples. classifier[testing, "Probabilities"] // Column { {<\|"NP" -> 0.496296, "P" -> 0.503704\|>}, {<\|"NP" -> 0.486345, "P" -> 0.513655\|>}, {<\|"NP" -> 0.499325, "P" -> 0.500675\|>}, {<\|"NP" -> 0.494909, "P" -> 0.505091\|>}, {<\|"NP" -> 0.492273, "P" -> 0.507727\|>}, {<\|"NP" -> 0.480284, "P" -> 0.519716\|>}, {<\|"NP" -> 0.458435, "P" -> 0.541565\|>} } I tried several of the available classification methods to see how much the results varied methods = {"DecisionTree", "GradientBoostedTrees", "LogisticRegression", "NaiveBayes", "NearestNeighbors", "NeuralNetwork", "RandomForest", "SupportVectorMachine"}; classifiers = Map[{#, Classify[N[training], Method -> #]} &, methods]; Map[First[#] -> Last[#][testing] &, classifiers] // Column Gives { {"DecisionTree" -> {"NP", "NP", "NP", "P", "NP", "NP", "P"}}, {"GradientBoostedTrees" -> {"P", "P", "NP", "NP", "NP", "P", "NP"}}, {"LogisticRegression" -> {"P", "P", "P", "P", "P", "P", "P"}}, {"NaiveBayes" -> {"NP", "P", "NP", "NP", "NP", "P", "P"}}, {"NearestNeighbors" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}}, {"NeuralNetwork" -> {"P", "P", "P", "NP", "P", "P", "P"}}, {"RandomForest" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}}, {"SupportVectorMachine" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}} } Without knowing the details of what this data represents it is hard to know which method is most appropriate.

I tried the fix that Jamie suggested and it does eliminate the error. However I don't think it explains why training on a subset of the training data also works even though the subset contains even fewer unique values.

Convert to numeric

classifier = Classify[N[training]]
classifier[testing]

All testing samples are classified as P, is that expected?

"P", "P", "P", "P", "P", "P", "P"}

The probabilities for P and NP are extremely close for most of the samples.

classifier[testing, "Probabilities"] // Column

{
 {<|"NP" -> 0.496296, "P" -> 0.503704|>},
 {<|"NP" -> 0.486345, "P" -> 0.513655|>},
 {<|"NP" -> 0.499325, "P" -> 0.500675|>},
 {<|"NP" -> 0.494909, "P" -> 0.505091|>},
 {<|"NP" -> 0.492273, "P" -> 0.507727|>},
 {<|"NP" -> 0.480284, "P" -> 0.519716|>},
 {<|"NP" -> 0.458435, "P" -> 0.541565|>}
}

I tried several of the available classification methods to see how much the results varied

methods = {"DecisionTree", "GradientBoostedTrees", 
   "LogisticRegression", "NaiveBayes", "NearestNeighbors", 
   "NeuralNetwork", "RandomForest", "SupportVectorMachine"};

classifiers = Map[{#, Classify[N[training], Method -> #]} &, methods];

Map[First[#] -> Last[#][testing] &, classifiers] // Column

Gives

{
 {"DecisionTree" -> {"NP", "NP", "NP", "P", "NP", "NP", "P"}},
 {"GradientBoostedTrees" -> {"P", "P", "NP", "NP", "NP", "P", "NP"}},
 {"LogisticRegression" -> {"P", "P", "P", "P", "P", "P", "P"}},
 {"NaiveBayes" -> {"NP", "P", "NP", "NP", "NP", "P", "P"}},
 {"NearestNeighbors" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}},
 {"NeuralNetwork" -> {"P", "P", "P", "NP", "P", "P", "P"}},
 {"RandomForest" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}},
 {"SupportVectorMachine" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}}
}

Without knowing the details of what this data represents it is hard to know which method is most appropriate.

POSTED BY: Rohit Namjoshi

Jim Baldwin

Jim Baldwin, Retired

Posted 7 years ago

POSTED BY: Jim Baldwin

Jamie Dixson

Jamie Dixson, UNT

Posted 7 years ago

That definitely sounds like the same problem. I will work with it some tomorrow. Thanks for pointing this out Jim.

POSTED BY: Jamie Dixson

Rohit Namjoshi

Posted 7 years ago

POSTED BY: Rohit Namjoshi

Jamie Dixson

Jamie Dixson, UNT

Posted 7 years ago

Thank you for working on this Rohit. Jim posted a link to a potential fix below. It sounds like more of a work around but it might work. I am going to work with it some tomorrow.

POSTED BY: Jamie Dixson

Jamie Dixson

Jamie Dixson, UNT

Posted 7 years ago

Any additional help with this would be greatly appreciated.

POSTED BY: Jamie Dixson

Rohit Namjoshi

Posted 7 years ago

I cannot reproduce this on 11.3.0 for Mac OS X. nn[TestData] {"B", "B"} Note that in the code you posted, both `TrainingData` and `TestData` are missing an opening `{`.

POSTED BY: Rohit Namjoshi

Jamie Dixson

Jamie Dixson, UNT

Posted 7 years ago

That was not my actual data, just examples. I just missed the opening bracket. The actual data files that I was using when the problem occurred are attached below. Attachments: TestData.txt TrainingData.txt

POSTED BY: Jamie Dixson

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback