# Automate data file imports to use Classify?

GROUPS:
 Q Q 2 Votes Hello, I am trying to use the default data files for machine learning. Often, these come as CSV or just data files that are text. For example, this data file has 768 records: data = ReadList[ "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-\ indians-diabetes/pima-indians-diabetes.data", Record, 768]; The column names represent: col_names = {"pregnant", "glucose", "bp", "skin", "insulin", "bmi", "pedigree", "age", "label"}; I would like to be able to easily partition that into something that can directly be fed into Classify, for example: mytrain ={{6,148,72,35,0,33.6,0.627,50}->1,{1,85,66,29,0,26.6,0.351,31}->0,{8,183,64,0,0,23.3,0.672,32}->1,{1,89,66,23,94,28.1,0.167,21}->0,{0,137,40,35,168,43.1,2.288,33}->1,{5,116,74,0,0,25.6,0.201,30}->0,{3,78,50,32,88,31.0,0.248,26}->1,{10,115,0,0,0,35.3,0.134,29}->0,{2,197,70,45,543,30.5,0.158,53}->1,{8,125,96,0,0,0.0,0.232,54}->1}; Is there a simple way to tell Mathematica to partition these 768 records and put them in the form given above {{...} -> label}, as shown in a generic way? That is, I can say split the data into a training and testing data set where I can select the number of columns, which is the label and how many items I want in the training and testing sets in the format needed by Mathematica?I tried messing with all of the standard commands, but I must be missing some fundamental thing about the representation of data.For what it is worth, I am trying to duplicate this example: http://www.ritchieng.com/machine-learning-evaluate-classification-modelincluding all of the classifier results, confusion matrix, metrics and ROC curves.Thank you for any insights.
14 days ago
6 Replies
 Hakan Kjellerstrand 6 Votes To answer your first question, using Import is much easier than ReadList. Import is quite intelligent in parsing the data. For the rest, I would do something like this. (* Import data and create the dataset *) dataImport = Import["https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"]; (* create the dataset *) colNames = {"pregnant", "glucose", "bp", "skin", "insulin", "bmi", "pedigree", "age", "label"}; dataAll = Thread[colNames -> #] & /@ dataImport // Map[Association] // Dataset; dataSelected = dataAll[All, {"pregnant", "insulin", "bmi", "age", "label"}] // Normal; data = Thread[Most[#] -> Last[#]] & /@ dataSelected (* Check the distribution of 0s and 1s, to get a base line: If one guess 0 for all examples it's 65% correct. *) Histogram@data[[All, 2]] Counts@data[[All, 2]] % / Total@% // N (* split into train and test data *) {dataTrain, dataTest} = TakeDrop[RandomSample[data], Round[Length[data]*0.9]]; And then train/test the dataset:  (* Here we let Classifier run for 60s *) AbsoluteTiming[cl = Classify[dataTrain, PerformanceGoal -> "Quality", TimeGoal -> 60]] (* Check the unseen data in the test set. *) cm = ClassifierMeasurements[cl, dataTest, PerformanceGoal -> "Quality"] cm["ConfusionMatrixPlot"] cm["Accuracy"] cm["AreaUnderROCCurve"] Note: There is a lot of statistics in the ClassifierMeasurements, probably all the one you need (and then many more). See the documentation for the complete list.One thing I miss in Mathematica is the N-fold cross validation (i.e. testing N different folds to get an average accuracy of these fold). So here's a simple (and slow) version that use random folding: ClearAll[crossValidation, crossValidation1] crossValidation1[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{len = Length[data], train, test, cl, cm}, {train, test} = TakeDrop[RandomSample[data], len - Round[len/folds]]; cl = Classify[train, TrainingProgressReporting -> None, TimeGoal -> time, PerformanceGoal -> performance]; cm = ClassifierMeasurements[cl, test]; cm["Accuracy"] ] crossValidation[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{accuracy, cv}, accuracy = Monitor[Table[cv = crossValidation1[data, folds, time, performance], {i, 1, folds}], {i, cv}]; Mean[accuracy] ] An example how to use this: (* Cross validation, 10 random folds, 20 s time limit, and go for quality *) crossValidation[data, 10, 20, "Quality"] 
11 days ago
 Q Q 1 Vote @Hakan Kjellerstrand, thank you for the excellent response!Cross validation is another thing I am learning. How can one store the 10-fold sets that were used to see if the accuracy is improved using this ensemble technique?Is there also a way to automatically do this using different classification techniques / algorithms to see if the results can be improved?Again, thank you for aiding in my learning process!
10 days ago
 Hakan Kjellerstrand 3 Votes @Q Q: Here are two simple functions for n-fold and checking all methods (including a warning/finding).1) Returning all the accuracies (and the mean) for random n-fold cross validation defined in my earlier answer: (* Return all accuracies *) crossValidationAllAccuracies[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{accuracy, cv}, accuracy = Monitor[ Table[cv = crossValidation1[data, folds, time, performance], {i, 1, folds}], {i, cv}]; {Mean[accuracy], accuracy} ] Example: crossValidationAllAccuracies[data, 10, Automatic, "Quality"] which gave this result when I ran it: {0.723377, {0.727273, 0.753247, 0.727273, 0.623377, 0.727273, 0.727273, 0.779221, 0.675325, 0.805195, 0.688312}} 2) Testing all methods. One should first note that Classify do check many (all?) methods automatically, so it's better to let Classify run to get the best model. The option ValidationSet can be set to Automatic (or to a specific test dataset) which will then use a validation set. Also, a thing I noticed when testing this is that when Classify runs with Method->Automatic it seems that it set the hyper parameters much better than with an explicit method. Here is an example of this. First we let Classify run with Method->Automatic (the default): cl = Classify[dataTrain, Method -> Automatic, PerformanceGoal -> "Quality"] ClassifierMeasurements[cl, dataTest, "Accuracy"] The method chosen is GradientBoostedTrees with an accuracy of 0.727273. Then we set Method->"GradientBoostedTrees" explicitly: cl = Classify[dataTrain, **Method -> "GradientBoostedTrees"**, PerformanceGoal -> "Quality"] ClassifierMeasurements[cl, dataTest, "Accuracy"] which give a (much) lower accuracy of 0.688312. This is a bit surprising. (I think there was an issue about this, either here at Wolfram Community or on StackOverflow Mathematica group. However, I cannot find it now.)That said, for demonstration purposes, here is code for explicit testing all methods, but be aware of the problem mentioned above. Also, to simplify it, I have not included the cross validation, so it just test on a single dataset. (* Testing all methods. *) testAll[trainData_, testData_, time_, performance_] := Module[{methods}, methods = {"DecisionTree", "GradientBoostedTrees", "LogisticRegression", "Markov", "NaiveBayes", "NearestNeighbors", "NeuralNetwork", "PriorBaseline", "RandomForest", "SupportVectorMachine", Automatic}; Association[# -> ClassifierMeasurements[Classify[trainData, Method -> #, PerformanceGoal -> performance, TimeGoal -> time, TrainingProgressReporting -> None], testData, "Accuracy"] & /@ methods] ] Example: AbsoluteTiming[testAll[dataTrain, dataTest, Automatic, "Quality"]] Result: {173.114, <|"DecisionTree" -> 0.597403, "GradientBoostedTrees" -> 0.688312, "LogisticRegression" -> 0.688312, "Markov" -> 0.662338, "NaiveBayes" -> 0.688312, "NearestNeighbors" -> 0.623377, "NeuralNetwork" -> 0.623377, "PriorBaseline" -> 0.584416, "RandomForest" -> 0.597403, "SupportVectorMachine" -> 0.688312, Automatic -> 0.688312|>} Again, we see that "GradientBoostedTrees" has this quite low accuracy.
 Vitaliy Kaurov 5 Votes @Hakan Kjellerstrand gave some excellent workflow, thank you! I'd just like to point out that this is exactly the place where you should read docs very carefully, for instance on Classify. You would learn that Classify takes Dataset directly, from where (via "See Also" in docs) you would learn about SemanticImport. So the general low-hustle schema would become something like this. data = SemanticImport["https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"] data // Length (*768*) Now you can feed resulting Dataset into Classify directly as, easily splitting off your training and test data c = Classify[data[;; 500] -> 9] This means take first 500 lines. From those 500 lines use first 8 columns as data and 9th column as label. See the results at work with some new data: data[501]  c[data[501, Most]] (*0*) And the same goes for ClassifierMeasurements: cm = ClassifierMeasurements[c, data[-(768 - 500) ;;]-> 9]  cm["ConfusionMatrixPlot"] cm["AccuracyRejectionPlot"]