Message Boards Message Boards

GROUPS:

Force Classify to treat data as numeric discrete count instead of Boolean?

Posted 1 month ago
343 Views
|
8 Replies
|
3 Total Likes
|

I am attempting to use "Classify" to classify based on two classes. I can use data that are clearly numeric like 1.234 with no problem. However, one of my datasets is discrete count data and that is causing a problem. Most of the data points are either 0 or 1 with a few higher numbers scattered throughout (See example below). Mathematica automatically selects "mixed input from the training set but when I enter the test data, Mathematica assumes it is boolean. When it gets to the first number >1, it throws an error (See below). I need to force it to recognize all of the data points in the training data as numeric and all of the points in the test data as numeric. The test and training data files are attached below. How can this be done?

Example Data:

TrainingData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}->
"A",{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}->"B"}   

'My datasets are actually much larger than this with 8000 features but this is just an example

nn=Classify[TrainingData]                

'This works fine but sets input to "Mixed"

TestData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0},
{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}}  

nn[TestData]         

'This throws the error

ClassifierFunction::mlincfttp: Incompatible variable type (Boolean) and variable value (2).
8 Replies
Posted 1 month ago

I cannot reproduce this on 11.3.0 for Mac OS X.

nn[TestData]
{"B", "B"}

Note that in the code you posted, both TrainingData and TestData are missing an opening {.

Posted 1 month ago

That was not my actual data, just examples. I just missed the opening bracket. The actual data files that I was using when the problem occurred are attached below.

Posted 1 month ago

Any additional help with this would be greatly appreciated.

Posted 1 month ago

Hi Jamie,

I tried with the actual data you provided and I can reproduce the problem. So, there is something about the data other than the presence of values different from 0, 1 that is causing the problem. I examined the training and testing data looking for anomalies and did not find anything odd.

training = Import["~/Downloads/TrainingData.txt"] // ToExpression;

Distribution of labels

training // Values // Tally
{{"P", 7}, {"NP", 14}}

Lengths of training data are all the same

training // Keys // Map[Length] // Union
{8000}

Count of training data values in each sample

training // Keys // Map[Counts] // Map[KeySort] // Column
{
 {<|0 -> 7791, 1 -> 191, 2 -> 9, 3 -> 1, 4 -> 4, 5 -> 2, 6 -> 1, 
   7 -> 1|>},
 {<|0 -> 7403, 1 -> 539, 2 -> 41, 3 -> 11, 4 -> 4, 6 -> 1, 7 -> 1|>},
 {<|0 -> 7732, 1 -> 251, 2 -> 15, 3 -> 2|>},
 {<|0 -> 7051, 1 -> 753, 2 -> 138, 3 -> 37, 4 -> 8, 5 -> 3, 6 -> 3, 
   7 -> 1, 9 -> 2, 10 -> 1, 12 -> 1, 23 -> 1, 37 -> 1|>},
 {<|0 -> 7606, 1 -> 373, 2 -> 14, 3 -> 2, 4 -> 3, 5 -> 1, 6 -> 1|>},
 {<|0 -> 7275, 1 -> 615, 2 -> 90, 3 -> 10, 4 -> 5, 5 -> 1, 6 -> 1, 
   31 -> 2, 46 -> 1|>},
 {<|0 -> 7631, 1 -> 318, 2 -> 35, 3 -> 5, 4 -> 4, 5 -> 1, 7 -> 1, 
   8 -> 3, 11 -> 1, 22 -> 1|>},
 {<|0 -> 7651, 1 -> 337, 2 -> 11, 3 -> 1|>},
 {<|0 -> 7633, 1 -> 349, 2 -> 18|>},
 {<|0 -> 7649, 1 -> 334, 2 -> 16, 3 -> 1|>},
 {<|0 -> 7623, 1 -> 353, 2 -> 24|>},
 {<|0 -> 6959, 1 -> 921, 2 -> 108, 3 -> 10, 4 -> 1, 5 -> 1|>},
 {<|0 -> 7283, 1 -> 657, 2 -> 51, 3 -> 7, 4 -> 2|>},
 {<|0 -> 7831, 1 -> 163, 2 -> 6|>},
 {<|0 -> 7371, 1 -> 576, 2 -> 48, 3 -> 2, 4 -> 3|>},
 {<|0 -> 7423, 1 -> 518, 2 -> 52, 3 -> 5, 4 -> 2|>},
 {<|0 -> 7627, 1 -> 352, 2 -> 17, 3 -> 4|>},
 {<|0 -> 7541, 1 -> 434, 2 -> 24, 3 -> 1|>},
 {<|0 -> 6993, 1 -> 797, 2 -> 141, 3 -> 50, 4 -> 15, 5 -> 3, 6 -> 1|>},
 {<|0 -> 6976, 1 -> 902, 2 -> 103, 3 -> 17, 4 -> 2|>},
 {<|0 -> 7548, 1 -> 435, 2 -> 17|>}
}

Unique training data values

training // Keys // Flatten // Union
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 46}

Testing data

testing = Import["~/Downloads/TestData.txt"] // ToExpression;

Lengths of testing data are all the same and match training

testing // Map[Length] // Union
{8000}

Count of testing data values in each sample

testing // Map[Counts] // Map[KeySort] // Column
{
 {<|0 -> 7564, 1 -> 414, 2 -> 22|>},
 {<|0 -> 7905, 1 -> 88, 2 -> 5, 3 -> 2|>},
 {<|0 -> 7426, 1 -> 549, 2 -> 24, 4 -> 1|>},
 {<|0 -> 7682, 1 -> 306, 2 -> 11, 3 -> 1|>},
 {<|0 -> 7787, 1 -> 204, 2 -> 9|>},
 {<|0 -> 7666, 1 -> 323, 2 -> 10, 9 -> 1|>},
 {<|0 -> 7741, 1 -> 190, 2 -> 39, 3 -> 15, 4 -> 7, 5 -> 3, 6 -> 1, 
   7 -> 1, 9 -> 1, 11 -> 1, 14 -> 1|>}
}

Unique testing data values

testing // Flatten // Union
{0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}

There are values in the testing data that are not in the training data and vice versa but I don't see why that would cause a problem.

In training, but not in testing

Complement[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 
  46}, {0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}]

{8, 10, 12, 22, 23, 31, 37, 46}

In testing, but not in training

Complement[{0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}, {0, 1, 2, 3, 4, 5, 6, 
  7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 46}]

{14}

I tried training on a subset of the data (first two samples of P and first two of NP) and the classifier worked on all of the test data. I have no idea why.

classifier = Classify[Join[training[[1 ;; 2]], training[[8 ;; 9]]]]
classifier[testing]
{"NP", "NP", "NP", "NP", "NP", "P", "P"}

If I have time I will explore further tomorrow.

Rohit

Posted 1 month ago

Thank you for working on this Rohit. Jim posted a link to a potential fix below. It sounds like more of a work around but it might work. I am going to work with it some tomorrow.

The answer might be here: Mathematica StackExchange.

Essentially you would use

nn=Classify[N[TrainingData]]      
nn[N[TestData]]      
Posted 1 month ago

That definitely sounds like the same problem. I will work with it some tomorrow. Thanks for pointing this out Jim.

Posted 1 month ago

I tried the fix that Jamie suggested and it does eliminate the error. However I don't think it explains why training on a subset of the training data also works even though the subset contains even fewer unique values.

Convert to numeric

classifier = Classify[N[training]]
classifier[testing]

All testing samples are classified as P, is that expected?

"P", "P", "P", "P", "P", "P", "P"}

The probabilities for P and NP are extremely close for most of the samples.

classifier[testing, "Probabilities"] // Column

{
 {<|"NP" -> 0.496296, "P" -> 0.503704|>},
 {<|"NP" -> 0.486345, "P" -> 0.513655|>},
 {<|"NP" -> 0.499325, "P" -> 0.500675|>},
 {<|"NP" -> 0.494909, "P" -> 0.505091|>},
 {<|"NP" -> 0.492273, "P" -> 0.507727|>},
 {<|"NP" -> 0.480284, "P" -> 0.519716|>},
 {<|"NP" -> 0.458435, "P" -> 0.541565|>}
} 

I tried several of the available classification methods to see how much the results varied

methods = {"DecisionTree", "GradientBoostedTrees", 
   "LogisticRegression", "NaiveBayes", "NearestNeighbors", 
   "NeuralNetwork", "RandomForest", "SupportVectorMachine"};

classifiers = Map[{#, Classify[N[training], Method -> #]} &, methods];

Map[First[#] -> Last[#][testing] &, classifiers] // Column

Gives

{
 {"DecisionTree" -> {"NP", "NP", "NP", "P", "NP", "NP", "P"}},
 {"GradientBoostedTrees" -> {"P", "P", "NP", "NP", "NP", "P", "NP"}},
 {"LogisticRegression" -> {"P", "P", "P", "P", "P", "P", "P"}},
 {"NaiveBayes" -> {"NP", "P", "NP", "NP", "NP", "P", "P"}},
 {"NearestNeighbors" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}},
 {"NeuralNetwork" -> {"P", "P", "P", "NP", "P", "P", "P"}},
 {"RandomForest" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}},
 {"SupportVectorMachine" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}}
}

Without knowing the details of what this data represents it is hard to know which method is most appropriate.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract