Hello, I am trying to use the default data files for machine learning. Often, these come as CSV or just data files that are text. For example, this data file has 768 records:
data = ReadList[
"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-\
indians-diabetes/pima-indians-diabetes.data", Record, 768];
The column names represent:
col_names = {"pregnant", "glucose", "bp", "skin", "insulin", "bmi", "pedigree", "age", "label"};
I would like to be able to easily partition that into something that can directly be fed into Classify, for example:
mytrain ={{6,148,72,35,0,33.6,0.627,50}->1,{1,85,66,29,0,26.6,0.351,31}->0,{8,183,64,0,0,23.3,0.672,32}->1,{1,89,66,23,94,28.1,0.167,21}->0,{0,137,40,35,168,43.1,2.288,33}->1,{5,116,74,0,0,25.6,0.201,30}->0,{3,78,50,32,88,31.0,0.248,26}->1,{10,115,0,0,0,35.3,0.134,29}->0,{2,197,70,45,543,30.5,0.158,53}->1,{8,125,96,0,0,0.0,0.232,54}->1};
Is there a simple way to tell Mathematica to partition these 768 records and put them in the form given above {{...} -> label}, as shown in a generic way? That is, I can say split the data into a training and testing data set where I can select the number of columns, which is the label and how many items I want in the training and testing sets in the format needed by Mathematica?
I tried messing with all of the standard commands, but I must be missing some fundamental thing about the representation of data.
For what it is worth, I am trying to duplicate this example:
http://www.ritchieng.com/machine-learning-evaluate-classification-model
including all of the classifier results, confusion matrix, metrics and ROC curves.
Thank you for any insights.