# Force Classify to treat data as numeric discrete count instead of Boolean?

Posted 1 month ago
343 Views
|
8 Replies
|
3 Total Likes
|
 I am attempting to use "Classify" to classify based on two classes. I can use data that are clearly numeric like 1.234 with no problem. However, one of my datasets is discrete count data and that is causing a problem. Most of the data points are either 0 or 1 with a few higher numbers scattered throughout (See example below). Mathematica automatically selects "mixed input from the training set but when I enter the test data, Mathematica assumes it is boolean. When it gets to the first number >1, it throws an error (See below). I need to force it to recognize all of the data points in the training data as numeric and all of the points in the test data as numeric. The test and training data files are attached below. How can this be done?Example Data: TrainingData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}-> "A",{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}->"B"} 'My datasets are actually much larger than this with 8000 features but this is just an example nn=Classify[TrainingData] 'This works fine but sets input to "Mixed" TestData={{0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}, {0,1,0,0,0,2,0,1,0,0,0,0,0,0,3,0,0,0,1,0}} nn[TestData] 'This throws the error ClassifierFunction::mlincfttp: Incompatible variable type (Boolean) and variable value (2).  Attachments:
Answer
8 Replies
Sort By:
Posted 1 month ago
 I cannot reproduce this on 11.3.0 for Mac OS X. nn[TestData] {"B", "B"} Note that in the code you posted, both TrainingData and TestData are missing an opening {.
Answer
Posted 1 month ago
 That was not my actual data, just examples. I just missed the opening bracket. The actual data files that I was using when the problem occurred are attached below. Attachments:
Answer
Posted 1 month ago
 Any additional help with this would be greatly appreciated.
Answer
Posted 1 month ago
 Hi Jamie,I tried with the actual data you provided and I can reproduce the problem. So, there is something about the data other than the presence of values different from 0, 1 that is causing the problem. I examined the training and testing data looking for anomalies and did not find anything odd. training = Import["~/Downloads/TrainingData.txt"] // ToExpression; Distribution of labels training // Values // Tally {{"P", 7}, {"NP", 14}} Lengths of training data are all the same training // Keys // Map[Length] // Union {8000} Count of training data values in each sample training // Keys // Map[Counts] // Map[KeySort] // Column { {<|0 -> 7791, 1 -> 191, 2 -> 9, 3 -> 1, 4 -> 4, 5 -> 2, 6 -> 1, 7 -> 1|>}, {<|0 -> 7403, 1 -> 539, 2 -> 41, 3 -> 11, 4 -> 4, 6 -> 1, 7 -> 1|>}, {<|0 -> 7732, 1 -> 251, 2 -> 15, 3 -> 2|>}, {<|0 -> 7051, 1 -> 753, 2 -> 138, 3 -> 37, 4 -> 8, 5 -> 3, 6 -> 3, 7 -> 1, 9 -> 2, 10 -> 1, 12 -> 1, 23 -> 1, 37 -> 1|>}, {<|0 -> 7606, 1 -> 373, 2 -> 14, 3 -> 2, 4 -> 3, 5 -> 1, 6 -> 1|>}, {<|0 -> 7275, 1 -> 615, 2 -> 90, 3 -> 10, 4 -> 5, 5 -> 1, 6 -> 1, 31 -> 2, 46 -> 1|>}, {<|0 -> 7631, 1 -> 318, 2 -> 35, 3 -> 5, 4 -> 4, 5 -> 1, 7 -> 1, 8 -> 3, 11 -> 1, 22 -> 1|>}, {<|0 -> 7651, 1 -> 337, 2 -> 11, 3 -> 1|>}, {<|0 -> 7633, 1 -> 349, 2 -> 18|>}, {<|0 -> 7649, 1 -> 334, 2 -> 16, 3 -> 1|>}, {<|0 -> 7623, 1 -> 353, 2 -> 24|>}, {<|0 -> 6959, 1 -> 921, 2 -> 108, 3 -> 10, 4 -> 1, 5 -> 1|>}, {<|0 -> 7283, 1 -> 657, 2 -> 51, 3 -> 7, 4 -> 2|>}, {<|0 -> 7831, 1 -> 163, 2 -> 6|>}, {<|0 -> 7371, 1 -> 576, 2 -> 48, 3 -> 2, 4 -> 3|>}, {<|0 -> 7423, 1 -> 518, 2 -> 52, 3 -> 5, 4 -> 2|>}, {<|0 -> 7627, 1 -> 352, 2 -> 17, 3 -> 4|>}, {<|0 -> 7541, 1 -> 434, 2 -> 24, 3 -> 1|>}, {<|0 -> 6993, 1 -> 797, 2 -> 141, 3 -> 50, 4 -> 15, 5 -> 3, 6 -> 1|>}, {<|0 -> 6976, 1 -> 902, 2 -> 103, 3 -> 17, 4 -> 2|>}, {<|0 -> 7548, 1 -> 435, 2 -> 17|>} } Unique training data values training // Keys // Flatten // Union {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 46} Testing data testing = Import["~/Downloads/TestData.txt"] // ToExpression; Lengths of testing data are all the same and match training testing // Map[Length] // Union {8000} Count of testing data values in each sample testing // Map[Counts] // Map[KeySort] // Column { {<|0 -> 7564, 1 -> 414, 2 -> 22|>}, {<|0 -> 7905, 1 -> 88, 2 -> 5, 3 -> 2|>}, {<|0 -> 7426, 1 -> 549, 2 -> 24, 4 -> 1|>}, {<|0 -> 7682, 1 -> 306, 2 -> 11, 3 -> 1|>}, {<|0 -> 7787, 1 -> 204, 2 -> 9|>}, {<|0 -> 7666, 1 -> 323, 2 -> 10, 9 -> 1|>}, {<|0 -> 7741, 1 -> 190, 2 -> 39, 3 -> 15, 4 -> 7, 5 -> 3, 6 -> 1, 7 -> 1, 9 -> 1, 11 -> 1, 14 -> 1|>} } Unique testing data values testing // Flatten // Union {0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14} There are values in the testing data that are not in the training data and vice versa but I don't see why that would cause a problem.In training, but not in testing Complement[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 46}, {0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}] {8, 10, 12, 22, 23, 31, 37, 46} In testing, but not in training Complement[{0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 46}] {14} I tried training on a subset of the data (first two samples of P and first two of NP) and the classifier worked on all of the test data. I have no idea why. classifier = Classify[Join[training[[1 ;; 2]], training[[8 ;; 9]]]] classifier[testing] {"NP", "NP", "NP", "NP", "NP", "P", "P"} If I have time I will explore further tomorrow.Rohit
Answer
Posted 1 month ago
 Thank you for working on this Rohit. Jim posted a link to a potential fix below. It sounds like more of a work around but it might work. I am going to work with it some tomorrow.
Answer
Posted 1 month ago
 The answer might be here: Mathematica StackExchange.Essentially you would use nn=Classify[N[TrainingData]] nn[N[TestData]] 
Answer
Posted 1 month ago
 That definitely sounds like the same problem. I will work with it some tomorrow. Thanks for pointing this out Jim.
Answer
Posted 1 month ago
 I tried the fix that Jamie suggested and it does eliminate the error. However I don't think it explains why training on a subset of the training data also works even though the subset contains even fewer unique values.Convert to numeric classifier = Classify[N[training]] classifier[testing] All testing samples are classified as P, is that expected? "P", "P", "P", "P", "P", "P", "P"} The probabilities for P and NP are extremely close for most of the samples. classifier[testing, "Probabilities"] // Column { {<|"NP" -> 0.496296, "P" -> 0.503704|>}, {<|"NP" -> 0.486345, "P" -> 0.513655|>}, {<|"NP" -> 0.499325, "P" -> 0.500675|>}, {<|"NP" -> 0.494909, "P" -> 0.505091|>}, {<|"NP" -> 0.492273, "P" -> 0.507727|>}, {<|"NP" -> 0.480284, "P" -> 0.519716|>}, {<|"NP" -> 0.458435, "P" -> 0.541565|>} } I tried several of the available classification methods to see how much the results varied methods = {"DecisionTree", "GradientBoostedTrees", "LogisticRegression", "NaiveBayes", "NearestNeighbors", "NeuralNetwork", "RandomForest", "SupportVectorMachine"}; classifiers = Map[{#, Classify[N[training], Method -> #]} &, methods]; Map[First[#] -> Last[#][testing] &, classifiers] // Column Gives { {"DecisionTree" -> {"NP", "NP", "NP", "P", "NP", "NP", "P"}}, {"GradientBoostedTrees" -> {"P", "P", "NP", "NP", "NP", "P", "NP"}}, {"LogisticRegression" -> {"P", "P", "P", "P", "P", "P", "P"}}, {"NaiveBayes" -> {"NP", "P", "NP", "NP", "NP", "P", "P"}}, {"NearestNeighbors" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}}, {"NeuralNetwork" -> {"P", "P", "P", "NP", "P", "P", "P"}}, {"RandomForest" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}}, {"SupportVectorMachine" -> {"NP", "NP", "NP", "NP", "NP", "NP", "NP"}} } Without knowing the details of what this data represents it is hard to know which method is most appropriate.
Answer
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments