Message Boards Message Boards

Simple Math Problem Shows Massive Flaw In All Machine Learning Algorythims

POSTED BY: David Johnston
3 Replies

Thanks for taking the time to reply. It means a lot. :)

The more I understand about the specifics of most ML, the more I realize why so many think it's been around for 50 years with no real advancements in their minds. I have actually talked to a few practitioners with PhD's and they all seem disenchanted and don't share my excited hope or vision for what ML "should" and "could" do.

You can see a Lua script that I modified here. It uses Darwinian Evolutionary Algorithm to learn to play Mario Brothers.

http://artificialbrilliance.com/ai-plays-mario-bros-darwinian-neural-net/

That is a good example of a cool use of an EA. The same kind of EA that generates billion on the stock market through High-Frequency Trading.

The Classify function by itself hasn't shown me that it can find the best algo with highest accuracy. It is mostly wrong in my experience.

Typically I use a series of test to split data and train on 70% of it and verify on 30%.

Here is my script:

Split Data:

combCat = CloudGet["combCat"];
dataCount = 50999;(*Get["dataCount"]*)

training = Part[combCat, 2 ;; Round[dataCount*0.7]];
"training" -> Length[training]
train = training;
train = RandomSample[train, 1000];
"train" -> {Length[train], Length[Part[train, 1]]}

validating = Part[combCat, Round[dataCount*0.7] ;; -1];
"validating" -> Length[validating]
validate = validating;
validate = RandomSample[validate, 1000];
"validate" -> {Length[validate], Length[Part[validate, 1]]}

Try different training methods:

cGeneral = Classify[train -> 1]

cNaiveBayes = Classify[train -> 1, {Method -> "NaiveBayes", PerformanceGoal -> "Quality"}];

cNearestNeighbors = Classify[train -> 1, {Method -> "NearestNeighbors", PerformanceGoal -> "Quality"}];

cLogisticRegression = Classify[train -> 1, {Method -> "LogisticRegression", PerformanceGoal -> "Quality"}];

cMarkov = Classify[train -> 1, Method -> "Markov"];

cNeuralNetwork = Classify[train -> 1, Method -> "NeuralNetwork"];

Then, I verify their accuracy scores:

"cGeneral" -> ClassifierMeasurements[cGeneral, validate -> 1, "Accuracy"]

"cNaiveBayes" -> ClassifierMeasurements[cNaiveBayes, validate -> 1, "Accuracy"]

"cNearestNeighbors" -> ClassifierMeasurements[cNearestNeighbors, validate -> 1, "Accuracy"]

"cLogisticRegression" ->  ClassifierMeasurements[cLogisticRegression, validate -> 1, "Accuracy"]

"cMarkov" -> ClassifierMeasurements[cMarkov, validate -> 1, "Accuracy"]

"cNeuralNetwork" -> ClassifierMeasurements[cNeuralNetwork, validate -> 1, "Accuracy"]

This works well for figuring out which algos are best on 1000 samples. Once I get through this, I usually go back and run 10,000 samples on the top 2 performing algos. The winner becomes the one I use.

However, this is only really helpful in cases where the item you are classifying is a string. It gets really poor scores if the prediction you want is numerical. For that, we have to use Predict instead of Classify. There is a problem though because Predict validate doesn't like list->1 to tell it what part of the lists is the answer. So we have to reformat the lists.

combNumb=numbCat;
pTraining = Part[combNumb, 2 ;; Round[dataCount 0.7]];
"pTraining" -> Length[pTraining]
pTrain = pTraining;
pTrain = Drop[#, 1] -> Part[Take[#, 1], 1] & /@ 
   RandomSample[pTrain, 35000];
"pTrain" -> {Length[pTrain], Length[Part[Part[pTrain, 1], 1]]}

pValidating = Part[combNumb, Round[dataCount 0.7] ;; -1];
"pValidating" -> Length[pValidating]
pValidate = pValidating;
pValidate = 
  Drop[#, 1] -> Part[Take[#, 1], 1] & /@ 
   RandomSample[pValidate, 15000];
"pValidate" -> {Length[pValidate], Length[Part[Part[pValidate, 1], 1]]}

The we check to see if it formatted correctly:

RandomChoice[pTrain]
RandomChoice[pValidate]

Then we train:

pGeneral = Predict[pTrain, {ValidationSet -> pValidate, PerformanceGoal -> "Quality"}];
pRandomForest = Predict[pTrain, {ValidationSet -> pValidate, Method -> "RandomForest", PerformanceGoal -> "Quality"}];
pNearestNeighbors = Predict[pTrain, {ValidationSet -> pValidate, Method -> "NearestNeighbors", PerformanceGoal -> "Quality"}];
pLogisticRegression = Predict[pTrain, {ValidationSet -> pValidate, Method -> "LinearRegression", PerformanceGoal -> "Quality"}];
pNeuralNetwork = Predict[pTrain, {ValidationSet -> pValidate, Method -> "NeuralNetwork"}];

Then we validate:

"pGeneral" -> PredictorMeasurements[pGeneral, pValidate, "LogLikelihoodRate"]
"pRandomForest" -> PredictorMeasurements[pRandomForest, pValidate, "LogLikelihoodRate"]
"pNearestNeighbors" -> PredictorMeasurements[pNearestNeighbors, pValidate, "LogLikelihoodRate"]
"pLogisticRegression" -> PredictorMeasurements[pLogisticRegression, pValidate, "LogLikelihoodRate"]
"pNeuralNetwork" -> PredictorMeasurements[pNeuralNetwork, pValidate, "LogLikelihoodRate"]

I played with all the different scores and none seemed like the best way to judge Predict. I wish there was an "Accuracy" like in Classify.

I had always imagined that "Classify" and "Predict" by themselves were doing something similar to this when you don't specify a method. It may do that on a smaller sample, I am not sure. However, I have never seen it get better scores than this method.

I am ignorant to many of the specifics of algorythims and how they came about. However, if I can code this, it could be a built in function.

Both Classify and Predict could be more consistent with each other and have simple extra parameters.

Example: cGeneral = Classify[train -> 1,{Method -> "TryAll", PerformanceGoal -> "Quality", TryTranspose->"True", FeatureSubset->"True"->50%, FeaturePermutation->"True"->50%, ValidationSet->validate->1}];]

the output could look like:

output=
"cRandomForest" -> 0.6191
"cNaiveBayes" -> 0.5096
"cNearestNeighbors" -> 0.35793
"cLogisticRegression" -> 0.434067
"cNeuralNetwork" -> 0.356667

I noticed that "PerformanceGoal->"Quality" kinda does a sort of an Ensamble Boost with RandomForest but not with the others. What I would really like to do is be able to do a RandomForest version of Neural Net. Let it iterate 100k+ times (if I so chose) and find the best Neural Net settings for the dataset.

The reason is, RandomForest tends to get highest scores with data under 1 Million rows. Neural Nets do well the bigger the data grows. However, there is no simple way to build an EA on WL Neural Nets that I have found. So, I think a second best would be to use RandomForest Boost/Bagged techniques to iterate thorugh thousands of ways to set up the Neural Net and check all the various settings to come up with the winning solution.

POSTED BY: David Johnston

Google's response is spot on. Fundamentally, it cannot return "10" if "10" isn't even a category that it was trained with. Classify is basically a set of supervised machine learning algorithms.

Looking at your input data as a human being, I would (understanding the input to mean what it means to the Classify) return 1 as well.

If you want to understand what Classify is doing, you can see what methods it chose using the ClassifierInformation function. If you do, you will see that Classify is using a series of markov chains, and that's not going to do what you want.

You are asking for an algorithm that can robustly handle recognizing sequences with "anything related to permutations, shifting, changing, organizing, reordering, etc. of the data". I would encourage you to take a look into different machine learning algorithms, both supervised and unsupervised and see what kinds of properties they have. What you are asking for here, is not reasonable to expect from a generic ML algorithm.

If you've discovered a "Massive Flaw in All Machine Learning Algorythims", it is that they less magical and intelligent than they often appear at first. Classify is by far the easiest to use ML tool I've ever seen, but if you want to push it beyond its conventional uses, you must understand how ML algorithms work and what their limitations are.

POSTED BY: Sean Clarke

Found something that works. Not sure if it can be applied in other more general situations yet. Basically, if the desired label appears anywhere in the data samples, it can find it. Basically, you just create every combination possible.

So far it only works with the Classify. Using the numerical data set and Predict it never gets the right answer.

subs = Union[Select[ArrayFlatten[Subsets[#] & /@ Transpose[catData], 1], Length[#] > 0 &]];

c = Classify[subs -> 1, PerformanceGoal -> "Quality"]

c[{"1st", "2nd", "3rd", "4th", "5th", "6th", "7th"}]

Out="10th"

I am sure this would drastically slow down most processes. With this tiny data set, it ballooned up to 2240 samples. I suppose it could be trimmed down by changing the Select length to something like 50% of the average sample length or something like that.

subs = Union[Select[ArrayFlatten[Subsets[#] & /@ catData, 1], Length[#] > Total[Length[#] & /@ catData]/Length[catData]*.5 &]];

This trims it down to 598 variations, which is still an almost 600% bloat.

I did notice it fails if your length Select gets close to the same length as the samples. The average length is 8 but if you limit Select to 6, 7 or 8 it will fail to find the right answer.

This doesn't really accomplish my goal of 3D predictions. If there are patterns in the columns there should be a way to include that in an algorithm. Limiting the algo to rows only may be good for some things but it's not good for what I believe is called semi-unsupervised learning.

In a way using Transpose and Subsets is a short cut to getting the right answer. I wonder if there is a more mathematically sound way to do it that could be generally applied to any sequential data set.

POSTED BY: David Johnston
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract