Group Abstract

Message Boards

WOLFRAM COMMUNITY

10.1K Views

6 Replies

17 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Automate data file imports to use Classify?

Q Q

Posted 8 years ago

Hello, I am trying to use the default data files for machine learning. Often, these come as CSV or just data files that are text. For example, this data file has 768 records: data = ReadList[ "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-\ indians-diabetes/pima-indians-diabetes.data", Record, 768]; The column names represent: col_names = {"pregnant", "glucose", "bp", "skin", "insulin", "bmi", "pedigree", "age", "label"}; I would like to be able to easily partition that into something that can directly be fed into Classify, for example: mytrain ={{6,148,72,35,0,33.6,0.627,50}->1,{1,85,66,29,0,26.6,0.351,31}->0,{8,183,64,0,0,23.3,0.672,32}->1,{1,89,66,23,94,28.1,0.167,21}->0,{0,137,40,35,168,43.1,2.288,33}->1,{5,116,74,0,0,25.6,0.201,30}->0,{3,78,50,32,88,31.0,0.248,26}->1,{10,115,0,0,0,35.3,0.134,29}->0,{2,197,70,45,543,30.5,0.158,53}->1,{8,125,96,0,0,0.0,0.232,54}->1}; Is there a simple way to tell Mathematica to partition these 768 records and put them in the form given above {{...} -> label}, as shown in a generic way? That is, I can say split the data into a training and testing data set where I can select the number of columns, which is the label and how many items I want in the training and testing sets in the format needed by Mathematica? I tried messing with all of the standard commands, but I must be missing some fundamental thing about the representation of data. For what it is worth, I am trying to duplicate this example: http://www.ritchieng.com/machine-learning-evaluate-classification-model including all of the classifier results, confusion matrix, metrics and ROC curves. Thank you for any insights.

POSTED BY: Q Q

6 Replies

Sort By:

Hakan Kjellerstrand

Hakan Kjellerstrand, Independent Researcher

Posted 8 years ago

To answer your first question, using `Import` is much easier than `ReadList`. `Import` is quite intelligent in parsing the data. For the rest, I would do something like this. (* Import data and create the dataset ) dataImport = Import["https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"]; ( create the dataset ) colNames = {"pregnant", "glucose", "bp", "skin", "insulin", "bmi", "pedigree", "age", "label"}; dataAll = Thread[colNames -> #] & /@ dataImport // Map[Association] // Dataset; dataSelected = dataAll[All, {"pregnant", "insulin", "bmi", "age", "label"}] // Normal; data = Thread[Most[#] -> Last[#]] & /@ dataSelected ( Check the distribution of 0s and 1s, to get a base line: If one guess 0 for all examples it's 65% correct. ) Histogram@data[[All, 2]] Counts@data[[All, 2]] % / Total@% // N ( split into train and test data ) {dataTrain, dataTest} = TakeDrop[RandomSample[data], Round[Length[data]0.9]]; And then train/test the dataset: (* Here we let Classifier run for 60s ) AbsoluteTiming[cl = Classify[dataTrain, PerformanceGoal -> "Quality", TimeGoal -> 60]] ( Check the unseen data in the test set. ) cm = ClassifierMeasurements[cl, dataTest, PerformanceGoal -> "Quality"] cm["ConfusionMatrixPlot"] cm["Accuracy"] cm["AreaUnderROCCurve"] Note: There is a lot of statistics in the `ClassifierMeasurements`, probably all the one you need (and then many more). See the documentation for the complete list. One thing I miss in Mathematica is the N-fold cross validation (i.e. testing N different folds to get an average accuracy of these fold). So here's a simple (and slow) version that use random folding: ClearAll[crossValidation, crossValidation1] crossValidation1[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{len = Length[data], train, test, cl, cm}, {train, test} = TakeDrop[RandomSample[data], len - Round[len/folds]]; cl = Classify[train, TrainingProgressReporting -> None, TimeGoal -> time, PerformanceGoal -> performance]; cm = ClassifierMeasurements[cl, test]; cm["Accuracy"] ] crossValidation[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{accuracy, cv}, accuracy = Monitor[Table[cv = crossValidation1[data, folds, time, performance], {i, 1, folds}], {i, cv}]; Mean[accuracy] ] An example how to use this: ( Cross validation, 10 random folds, 20 s time limit, and go for quality *) crossValidation[data, 10, 20, "Quality"]

To answer your first question, using Import is much easier than ReadList. Import is quite intelligent in parsing the data.

For the rest, I would do something like this.

(* Import data and create the dataset *)
dataImport =  Import["https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"];

(* create the dataset *)
colNames = {"pregnant", "glucose", "bp", "skin", "insulin", "bmi", "pedigree", "age", "label"};
dataAll = Thread[colNames -> #] & /@ dataImport // Map[Association] // Dataset;

dataSelected = dataAll[All, {"pregnant", "insulin", "bmi", "age", "label"}] // Normal;
data = Thread[Most[#] -> Last[#]] & /@ dataSelected

(* Check the distribution of 0s and 1s, to get a base line:
If one guess 0 for all examples it's 65% correct. *)
Histogram@data[[All, 2]]
Counts@data[[All, 2]]
% / Total@% // N

(* split into  train and test data *)
{dataTrain, dataTest} = TakeDrop[RandomSample[data], Round[Length[data]*0.9]];

And then train/test the dataset:

 (* Here we let Classifier run for 60s *)
AbsoluteTiming[cl = Classify[dataTrain, PerformanceGoal -> "Quality", TimeGoal -> 60]]

(* Check the unseen data in the test set. *) 
cm = ClassifierMeasurements[cl, dataTest, PerformanceGoal -> "Quality"]
cm["ConfusionMatrixPlot"]
cm["Accuracy"]
cm["AreaUnderROCCurve"]

Note: There is a lot of statistics in the ClassifierMeasurements, probably all the one you need (and then many more). See the documentation for the complete list.

One thing I miss in Mathematica is the N-fold cross validation (i.e. testing N different folds to get an average accuracy of these fold). So here's a simple (and slow) version that use random folding:

ClearAll[crossValidation, crossValidation1]
crossValidation1[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{len = Length[data], train, test, cl, cm},
  {train, test} = TakeDrop[RandomSample[data], len - Round[len/folds]];
  cl = Classify[train, TrainingProgressReporting -> None, TimeGoal -> time, PerformanceGoal -> performance];
  cm = ClassifierMeasurements[cl, test];
  cm["Accuracy"]
  ]
crossValidation[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{accuracy, cv},
  accuracy = Monitor[Table[cv = crossValidation1[data, folds, time, performance], {i, 1, folds}], {i, cv}];
  Mean[accuracy]
  ]

An example how to use this:

(* Cross validation, 10 random folds, 20 s time limit, and go for quality *)
crossValidation[data, 10, 20, "Quality"]

POSTED BY: Hakan Kjellerstrand

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 8 years ago

@Hakan Kjellerstrand gave some excellent workflow, thank you! I'd just like to point out that this is exactly the place where you should read docs very carefully, for instance on Classify. You would learn that `Classify` takes Dataset directly, from where (via "See Also" in docs) you would learn about SemanticImport. So the general low-hustle schema would become something like this. data = SemanticImport["https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"] data // Length (768) Now you can feed resulting `Dataset` into `Classify` directly as, easily splitting off your training and test data c = Classify[data[;; 500] -> 9] This means take first 500 lines. From those 500 lines use first 8 columns as data and 9th column as label. See the results at work with some new data: data[501] c[data[501, Most]] (0) And the same goes for `ClassifierMeasurements`: cm = ClassifierMeasurements[c, data[-(768 - 500) ;;]-> 9] cm["ConfusionMatrixPlot"] cm["AccuracyRejectionPlot"]

POSTED BY: Vitaliy Kaurov

Hakan Kjellerstrand

Hakan Kjellerstrand, Independent Researcher

Posted 8 years ago

@Q Q: Here are two simple functions for n-fold and checking all methods (including a warning/finding). 1) Returning all the accuracies (and the mean) for random n-fold cross validation defined in my earlier answer: (* Return all accuracies ) crossValidationAllAccuracies[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{accuracy, cv}, accuracy = Monitor[ Table[cv = crossValidation1[data, folds, time, performance], {i, 1, folds}], {i, cv}]; {Mean[accuracy], accuracy} ] Example: crossValidationAllAccuracies[data, 10, Automatic, "Quality"] which gave this result when I ran it: {0.723377, {0.727273, 0.753247, 0.727273, 0.623377, 0.727273, 0.727273, 0.779221, 0.675325, 0.805195, 0.688312}} 2) Testing all methods. One should first note that `Classify` do check many (all?) methods automatically, so it's better to let `Classify` run to get the best model. The option `ValidationSet` can be set to `Automatic` (or to a specific test dataset) which will then use a validation set. Also, a thing I noticed when testing this is that when `Classify` runs with `Method->Automatic` it seems that it set the hyper parameters much better than with an explicit method. Here is an example of this. First we let `Classify` run with `Method->Automatic` (the default): cl = Classify[dataTrain, Method -> Automatic, PerformanceGoal -> "Quality"] ClassifierMeasurements[cl, dataTest, "Accuracy"] The method chosen is GradientBoostedTrees with an accuracy of 0.727273. Then we set `Method->"GradientBoostedTrees"` explicitly: cl = Classify[dataTrain, Method -> "GradientBoostedTrees", PerformanceGoal -> "Quality"] ClassifierMeasurements[cl, dataTest, "Accuracy"] which give a (much) lower accuracy of 0.688312. This is a bit surprising. (I think there was an issue about this, either here at Wolfram Community or on StackOverflow Mathematica group. However, I cannot find it now.) That said, for demonstration purposes, here is code for explicit testing all methods, but be aware of the problem mentioned above. Also, to simplify it, I have not* included the cross validation, so it just test on a single dataset. (* Testing all methods. *) testAll[trainData_, testData_, time_, performance_] := Module[{methods}, methods = {"DecisionTree", "GradientBoostedTrees", "LogisticRegression", "Markov", "NaiveBayes", "NearestNeighbors", "NeuralNetwork", "PriorBaseline", "RandomForest", "SupportVectorMachine", Automatic}; Association[# -> ClassifierMeasurements[Classify[trainData, Method -> #, PerformanceGoal -> performance, TimeGoal -> time, TrainingProgressReporting -> None], testData, "Accuracy"] & /@ methods] ] Example: AbsoluteTiming[testAll[dataTrain, dataTest, Automatic, "Quality"]] Result: {173.114, <\|"DecisionTree" -> 0.597403, "GradientBoostedTrees" -> 0.688312, "LogisticRegression" -> 0.688312, "Markov" -> 0.662338, "NaiveBayes" -> 0.688312, "NearestNeighbors" -> 0.623377, "NeuralNetwork" -> 0.623377, "PriorBaseline" -> 0.584416, "RandomForest" -> 0.597403, "SupportVectorMachine" -> 0.688312, Automatic -> 0.688312\|>} Again, we see that "GradientBoostedTrees" has this quite low accuracy.

@Q Q: Here are two simple functions for n-fold and checking all methods (including a warning/finding).

1) Returning all the accuracies (and the mean) for random n-fold cross validation defined in my earlier answer:

(* Return all accuracies *)    
crossValidationAllAccuracies[data_, folds_: 10, time_: Automatic, performance_: Automatic] := Module[{accuracy, cv},
  accuracy = Monitor[ Table[cv = crossValidation1[data, folds, time, performance], {i, 1, folds}], {i, cv}];
  {Mean[accuracy], accuracy}
  ]

Example:

crossValidationAllAccuracies[data, 10, Automatic, "Quality"]

which gave this result when I ran it:

{0.723377, {0.727273, 0.753247, 0.727273, 0.623377, 0.727273, 0.727273, 0.779221, 0.675325, 0.805195, 0.688312}}

2) Testing all methods.

One should first note that Classify do check many (all?) methods automatically, so it's better to let Classify run to get the best model. The option ValidationSet can be set to Automatic (or to a specific test dataset) which will then use a validation set.

Also, a thing I noticed when testing this is that when Classify runs with Method->Automatic it seems that it set the hyper parameters much better than with an explicit method. Here is an example of this. First we let Classify run with Method->Automatic (the default):

cl = Classify[dataTrain, Method -> Automatic, PerformanceGoal -> "Quality"]
ClassifierMeasurements[cl, dataTest, "Accuracy"]

The method chosen is GradientBoostedTrees with an accuracy of 0.727273. Then we set Method->"GradientBoostedTrees" explicitly:

cl = Classify[dataTrain, **Method -> "GradientBoostedTrees"**, PerformanceGoal -> "Quality"]
ClassifierMeasurements[cl, dataTest, "Accuracy"]

which give a (much) lower accuracy of 0.688312. This is a bit surprising. (I think there was an issue about this, either here at Wolfram Community or on StackOverflow Mathematica group. However, I cannot find it now.)

That said, for demonstration purposes, here is code for explicit testing all methods, but be aware of the problem mentioned above. Also, to simplify it, I have not included the cross validation, so it just test on a single dataset.

(* Testing all methods. *)  
testAll[trainData_, testData_, time_, performance_] := Module[{methods},
   methods  = {"DecisionTree", "GradientBoostedTrees", "LogisticRegression", "Markov", "NaiveBayes", "NearestNeighbors",  
    "NeuralNetwork", "PriorBaseline", "RandomForest", "SupportVectorMachine", Automatic};
   Association[# -> ClassifierMeasurements[Classify[trainData, Method -> #, PerformanceGoal -> performance, TimeGoal -> time, TrainingProgressReporting -> None], 
    testData, "Accuracy"] & /@ methods]
   ]

Example:

AbsoluteTiming[testAll[dataTrain, dataTest, Automatic, "Quality"]]

Result:

{173.114, <|"DecisionTree" -> 0.597403,   "GradientBoostedTrees" -> 0.688312,   "LogisticRegression" -> 0.688312, 
"Markov" -> 0.662338,   "NaiveBayes" -> 0.688312, "NearestNeighbors" -> 0.623377,   "NeuralNetwork" -> 0.623377, 
"PriorBaseline" -> 0.584416,   "RandomForest" -> 0.597403, "SupportVectorMachine" -> 0.688312,   
Automatic -> 0.688312|>}

Again, we see that "GradientBoostedTrees" has this quite low accuracy.

POSTED BY: Hakan Kjellerstrand

Q Q

Posted 8 years ago

@Hakan Kjellerstrand, thank you for the excellent response! Cross validation is another thing I am learning. How can one store the 10-fold sets that were used to see if the accuracy is improved using this ensemble technique? Is there also a way to automatically do this using different classification techniques / algorithms to see if the results can be improved? Again, thank you for aiding in my learning process!

POSTED BY: Q Q

Q Q

Posted 8 years ago

@Hakan Kjellerstrand Just excellent - thank you so much - greatly appreciated! I have so much to learn and this makes my life easier.

POSTED BY: Q Q

Q Q

Posted 8 years ago

@Vitaliy Kaurov, thank you for providing valuable inputs as I only started dabbling in this area very recently. I really appreciate doing things properly and using the best commands to do so as that is my purpose for using this tool! it would be so helpful to users if MMA took sample data sets like the one I pointed to and did very detailed examples of the process, commands and the like. Regards

POSTED BY: Q Q

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback