Group Abstract Group Abstract

Message Boards Message Boards

0
|
5.9K Views
|
8 Replies
|
3 Total Likes
View groups...
Share
Share this post:

Force Classify to treat data as numeric discrete count instead of Boolean?

Posted 7 years ago

I am attempting to use "Classify" to classify based on two classes. I can use data that are clearly numeric like 1.234 with no problem. However, one of my datasets is discrete count data and that is causing a problem. Most of the data points are either 0 or 1 with a few higher numbers scattered throughout (See example below). Mathematica automatically selects "mixed input from the training set but when I enter the test data, Mathematica assumes it is boolean. When it gets to the first number >1, it throws an error (See below). I need to force it to recognize all of the data points in the training data as numeric and all of the points in the test data as numeric. The test and training data files are attached below. How can this be done?

Example Data:

TrainingData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}->
"A",{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}->"B"}   

'My datasets are actually much larger than this with 8000 features but this is just an example

nn=Classify[TrainingData]                

'This works fine but sets input to "Mixed"

TestData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0},
{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}}  

nn[TestData]         

'This throws the error

ClassifierFunction::mlincfttp: Incompatible variable type (Boolean) and variable value (2).
POSTED BY: Jamie Dixson
8 Replies
Posted 7 years ago

I tried the fix that Jamie suggested and it does eliminate the error. However I don't think it explains why training on a subset of the training data also works even though the subset contains even fewer unique values.

Convert to numeric

classifier = Classify[N[training]]
classifier[testing]

All testing samples are classified as P, is that expected?

"P", "P", "P", "P", "P", "P", "P"}

The probabilities for P and NP are extremely close for most of the samples.

classifier[testing, "Probabilities"] // Column

{
 {<|"NP" -> 0.496296, "P" -> 0.503704|>},
 {<|"NP" -> 0.486345, "P" -> 0.513655|>},
 {<|"NP" -> 0.499325, "P" -> 0.500675|>},
 {<|"NP" -> 0.494909, "P" -> 0.505091|>},
 {<|"NP" -> 0.492273, "P" -> 0.507727|>},
 {<|"NP" -> 0.480284, "P" -> 0.519716|>},
 {<|"NP" -> 0.458435, "P" -> 0.541565|>}
} 

I tried several of the available classification methods to see how much the results varied

methods = {"DecisionTree", "GradientBoostedTrees", 
   "LogisticRegression", "NaiveBayes", "NearestNeighbors", 
   "NeuralNetwork", "RandomForest", "SupportVectorMachine"};

classifiers = Map[{#, Classify[N[training], Method -> #]} &, methods];

Map[First[#] -> Last[#][testing] &, classifiers] // Column

Gives

{
 {"DecisionTree" -> {"NP", "NP", "NP", "P", "NP", "NP", "P"}},
 {"GradientBoostedTrees" -> {"P", "P", "NP", "NP", "NP", "P", "NP"}},
 {"LogisticRegression" -> {"P", "P", "P", "P", "P", "P", "P"}},
 {"NaiveBayes" -> {"NP", "P", "NP", "NP", "NP", "P", "P"}},
 {"NearestNeighbors" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}},
 {"NeuralNetwork" -> {"P", "P", "P", "NP", "P", "P", "P"}},
 {"RandomForest" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}},
 {"SupportVectorMachine" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}}
}

Without knowing the details of what this data represents it is hard to know which method is most appropriate.

POSTED BY: Rohit Namjoshi
Posted 7 years ago
POSTED BY: Jim Baldwin
Posted 7 years ago

That definitely sounds like the same problem. I will work with it some tomorrow. Thanks for pointing this out Jim.

POSTED BY: Jamie Dixson
Posted 7 years ago
POSTED BY: Rohit Namjoshi
Posted 7 years ago

Thank you for working on this Rohit. Jim posted a link to a potential fix below. It sounds like more of a work around but it might work. I am going to work with it some tomorrow.

POSTED BY: Jamie Dixson
Posted 7 years ago

Any additional help with this would be greatly appreciated.

POSTED BY: Jamie Dixson
Posted 7 years ago

I cannot reproduce this on 11.3.0 for Mac OS X.

nn[TestData]
{"B", "B"}

Note that in the code you posted, both TrainingData and TestData are missing an opening {.

POSTED BY: Rohit Namjoshi
Posted 7 years ago

That was not my actual data, just examples. I just missed the opening bracket. The actual data files that I was using when the problem occurred are attached below.

POSTED BY: Jamie Dixson
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard