Hi Jamie,
I tried with the actual data you provided and I can reproduce the problem. So, there is something about the data other than the presence of values different from 0, 1 that is causing the problem. I examined the training and testing data looking for anomalies and did not find anything odd.
training = Import["~/Downloads/TrainingData.txt"] // ToExpression;
Distribution of labels
training // Values // Tally
{{"P", 7}, {"NP", 14}}
Lengths of training data are all the same
training // Keys // Map[Length] // Union
{8000}
Count of training data values in each sample
training // Keys // Map[Counts] // Map[KeySort] // Column
{
{<|0 -> 7791, 1 -> 191, 2 -> 9, 3 -> 1, 4 -> 4, 5 -> 2, 6 -> 1,
7 -> 1|>},
{<|0 -> 7403, 1 -> 539, 2 -> 41, 3 -> 11, 4 -> 4, 6 -> 1, 7 -> 1|>},
{<|0 -> 7732, 1 -> 251, 2 -> 15, 3 -> 2|>},
{<|0 -> 7051, 1 -> 753, 2 -> 138, 3 -> 37, 4 -> 8, 5 -> 3, 6 -> 3,
7 -> 1, 9 -> 2, 10 -> 1, 12 -> 1, 23 -> 1, 37 -> 1|>},
{<|0 -> 7606, 1 -> 373, 2 -> 14, 3 -> 2, 4 -> 3, 5 -> 1, 6 -> 1|>},
{<|0 -> 7275, 1 -> 615, 2 -> 90, 3 -> 10, 4 -> 5, 5 -> 1, 6 -> 1,
31 -> 2, 46 -> 1|>},
{<|0 -> 7631, 1 -> 318, 2 -> 35, 3 -> 5, 4 -> 4, 5 -> 1, 7 -> 1,
8 -> 3, 11 -> 1, 22 -> 1|>},
{<|0 -> 7651, 1 -> 337, 2 -> 11, 3 -> 1|>},
{<|0 -> 7633, 1 -> 349, 2 -> 18|>},
{<|0 -> 7649, 1 -> 334, 2 -> 16, 3 -> 1|>},
{<|0 -> 7623, 1 -> 353, 2 -> 24|>},
{<|0 -> 6959, 1 -> 921, 2 -> 108, 3 -> 10, 4 -> 1, 5 -> 1|>},
{<|0 -> 7283, 1 -> 657, 2 -> 51, 3 -> 7, 4 -> 2|>},
{<|0 -> 7831, 1 -> 163, 2 -> 6|>},
{<|0 -> 7371, 1 -> 576, 2 -> 48, 3 -> 2, 4 -> 3|>},
{<|0 -> 7423, 1 -> 518, 2 -> 52, 3 -> 5, 4 -> 2|>},
{<|0 -> 7627, 1 -> 352, 2 -> 17, 3 -> 4|>},
{<|0 -> 7541, 1 -> 434, 2 -> 24, 3 -> 1|>},
{<|0 -> 6993, 1 -> 797, 2 -> 141, 3 -> 50, 4 -> 15, 5 -> 3, 6 -> 1|>},
{<|0 -> 6976, 1 -> 902, 2 -> 103, 3 -> 17, 4 -> 2|>},
{<|0 -> 7548, 1 -> 435, 2 -> 17|>}
}
Unique training data values
training // Keys // Flatten // Union
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 46}
Testing data
testing = Import["~/Downloads/TestData.txt"] // ToExpression;
Lengths of testing data are all the same and match training
testing // Map[Length] // Union
{8000}
Count of testing data values in each sample
testing // Map[Counts] // Map[KeySort] // Column
{
{<|0 -> 7564, 1 -> 414, 2 -> 22|>},
{<|0 -> 7905, 1 -> 88, 2 -> 5, 3 -> 2|>},
{<|0 -> 7426, 1 -> 549, 2 -> 24, 4 -> 1|>},
{<|0 -> 7682, 1 -> 306, 2 -> 11, 3 -> 1|>},
{<|0 -> 7787, 1 -> 204, 2 -> 9|>},
{<|0 -> 7666, 1 -> 323, 2 -> 10, 9 -> 1|>},
{<|0 -> 7741, 1 -> 190, 2 -> 39, 3 -> 15, 4 -> 7, 5 -> 3, 6 -> 1,
7 -> 1, 9 -> 1, 11 -> 1, 14 -> 1|>}
}
Unique testing data values
testing // Flatten // Union
{0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}
There are values in the testing data that are not in the training data and vice versa but I don't see why that would cause a problem.
In training, but not in testing
Complement[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 22, 23, 31, 37,
46}, {0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}]
{8, 10, 12, 22, 23, 31, 37, 46}
In testing, but not in training
Complement[{0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14}, {0, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 22, 23, 31, 37, 46}]
{14}
I tried training on a subset of the data (first two samples of P and first two of NP) and the classifier worked on all of the test data. I have no idea why.
classifier = Classify[Join[training[[1 ;; 2]], training[[8 ;; 9]]]]
classifier[testing]
{"NP", "NP", "NP", "NP", "NP", "P", "P"}
If I have time I will explore further tomorrow.
Rohit