Thanks for taking the time to reply. It means a lot. :)
The more I understand about the specifics of most ML, the more I realize why so many think it's been around for 50 years with no real advancements in their minds. I have actually talked to a few practitioners with PhD's and they all seem disenchanted and don't share my excited hope or vision for what ML "should" and "could" do.
You can see a Lua script that I modified here. It uses Darwinian Evolutionary Algorithm to learn to play Mario Brothers.
http://artificialbrilliance.com/ai-plays-mario-bros-darwinian-neural-net/
That is a good example of a cool use of an EA. The same kind of EA that generates billion on the stock market through High-Frequency Trading.
The Classify function by itself hasn't shown me that it can find the best algo with highest accuracy. It is mostly wrong in my experience.
Typically I use a series of test to split data and train on 70% of it and verify on 30%.
Here is my script:
Split Data:
combCat = CloudGet["combCat"];
dataCount = 50999;(*Get["dataCount"]*)
training = Part[combCat, 2 ;; Round[dataCount*0.7]];
"training" -> Length[training]
train = training;
train = RandomSample[train, 1000];
"train" -> {Length[train], Length[Part[train, 1]]}
validating = Part[combCat, Round[dataCount*0.7] ;; -1];
"validating" -> Length[validating]
validate = validating;
validate = RandomSample[validate, 1000];
"validate" -> {Length[validate], Length[Part[validate, 1]]}
Try different training methods:
cGeneral = Classify[train -> 1]
cNaiveBayes = Classify[train -> 1, {Method -> "NaiveBayes", PerformanceGoal -> "Quality"}];
cNearestNeighbors = Classify[train -> 1, {Method -> "NearestNeighbors", PerformanceGoal -> "Quality"}];
cLogisticRegression = Classify[train -> 1, {Method -> "LogisticRegression", PerformanceGoal -> "Quality"}];
cMarkov = Classify[train -> 1, Method -> "Markov"];
cNeuralNetwork = Classify[train -> 1, Method -> "NeuralNetwork"];
Then, I verify their accuracy scores:
"cGeneral" -> ClassifierMeasurements[cGeneral, validate -> 1, "Accuracy"]
"cNaiveBayes" -> ClassifierMeasurements[cNaiveBayes, validate -> 1, "Accuracy"]
"cNearestNeighbors" -> ClassifierMeasurements[cNearestNeighbors, validate -> 1, "Accuracy"]
"cLogisticRegression" -> ClassifierMeasurements[cLogisticRegression, validate -> 1, "Accuracy"]
"cMarkov" -> ClassifierMeasurements[cMarkov, validate -> 1, "Accuracy"]
"cNeuralNetwork" -> ClassifierMeasurements[cNeuralNetwork, validate -> 1, "Accuracy"]
This works well for figuring out which algos are best on 1000 samples. Once I get through this, I usually go back and run 10,000 samples on the top 2 performing algos. The winner becomes the one I use.
However, this is only really helpful in cases where the item you are classifying is a string. It gets really poor scores if the prediction you want is numerical. For that, we have to use Predict instead of Classify. There is a problem though because Predict validate doesn't like list->1 to tell it what part of the lists is the answer. So we have to reformat the lists.
combNumb=numbCat;
pTraining = Part[combNumb, 2 ;; Round[dataCount 0.7]];
"pTraining" -> Length[pTraining]
pTrain = pTraining;
pTrain = Drop[#, 1] -> Part[Take[#, 1], 1] & /@
RandomSample[pTrain, 35000];
"pTrain" -> {Length[pTrain], Length[Part[Part[pTrain, 1], 1]]}
pValidating = Part[combNumb, Round[dataCount 0.7] ;; -1];
"pValidating" -> Length[pValidating]
pValidate = pValidating;
pValidate =
Drop[#, 1] -> Part[Take[#, 1], 1] & /@
RandomSample[pValidate, 15000];
"pValidate" -> {Length[pValidate], Length[Part[Part[pValidate, 1], 1]]}
The we check to see if it formatted correctly:
RandomChoice[pTrain]
RandomChoice[pValidate]
Then we train:
pGeneral = Predict[pTrain, {ValidationSet -> pValidate, PerformanceGoal -> "Quality"}];
pRandomForest = Predict[pTrain, {ValidationSet -> pValidate, Method -> "RandomForest", PerformanceGoal -> "Quality"}];
pNearestNeighbors = Predict[pTrain, {ValidationSet -> pValidate, Method -> "NearestNeighbors", PerformanceGoal -> "Quality"}];
pLogisticRegression = Predict[pTrain, {ValidationSet -> pValidate, Method -> "LinearRegression", PerformanceGoal -> "Quality"}];
pNeuralNetwork = Predict[pTrain, {ValidationSet -> pValidate, Method -> "NeuralNetwork"}];
Then we validate:
"pGeneral" -> PredictorMeasurements[pGeneral, pValidate, "LogLikelihoodRate"]
"pRandomForest" -> PredictorMeasurements[pRandomForest, pValidate, "LogLikelihoodRate"]
"pNearestNeighbors" -> PredictorMeasurements[pNearestNeighbors, pValidate, "LogLikelihoodRate"]
"pLogisticRegression" -> PredictorMeasurements[pLogisticRegression, pValidate, "LogLikelihoodRate"]
"pNeuralNetwork" -> PredictorMeasurements[pNeuralNetwork, pValidate, "LogLikelihoodRate"]
I played with all the different scores and none seemed like the best way to judge Predict. I wish there was an "Accuracy" like in Classify.
I had always imagined that "Classify" and "Predict" by themselves were doing something similar to this when you don't specify a method. It may do that on a smaller sample, I am not sure. However, I have never seen it get better scores than this method.
I am ignorant to many of the specifics of algorythims and how they came about. However, if I can code this, it could be a built in function.
Both Classify and Predict could be more consistent with each other and have simple extra parameters.
Example:
cGeneral = Classify[train -> 1,{Method -> "TryAll", PerformanceGoal -> "Quality", TryTranspose->"True", FeatureSubset->"True"->50%, FeaturePermutation->"True"->50%, ValidationSet->validate->1}];]
the output could look like:
output=
"cRandomForest" -> 0.6191
"cNaiveBayes" -> 0.5096
"cNearestNeighbors" -> 0.35793
"cLogisticRegression" -> 0.434067
"cNeuralNetwork" -> 0.356667
I noticed that "PerformanceGoal->"Quality" kinda does a sort of an Ensamble Boost with RandomForest but not with the others. What I would really like to do is be able to do a RandomForest version of Neural Net. Let it iterate 100k+ times (if I so chose) and find the best Neural Net settings for the dataset.
The reason is, RandomForest tends to get highest scores with data under 1 Million rows. Neural Nets do well the bigger the data grows. However, there is no simple way to build an EA on WL Neural Nets that I have found. So, I think a second best would be to use RandomForest Boost/Bagged techniques to iterate thorugh thousands of ways to set up the Neural Net and check all the various settings to come up with the winning solution.