I am not all that familiar with machine learning innards so I cannot help with specifics of NetChain. I did run some more basic code and will show the results here. First thing: there is a typograhic mismatch between your notebook and the reop directory layout. Very hard on those of us "of an age", in terms of trying to make sense of the error messages. Anyway, use "NDSB2" consistently and all will be better.
That said, the git clone instructions were excellent and very much appreciated. Especially by those of us "of an age", where figuring out how to properly install software quickly becomes hopeless.
I will pick up from where your code in ndsb2.nb has already set up dat
. I use the first 2000 entries for training purposes and the next 1000 for testing. One could of course do many random runs since there are nearly 12000 entries. So here we proceed.
n = 2000;
traindata = dat[[1 ;; n, 1]];
trainvalues = dat[[1 ;; n, 2]];
m = 1000;
testdata = dat[[n + 1 ;; n + m, 1]];
testvalues = dat[[n + 1 ;; n + m, 2]];
I use several methods for the function Predict
, and in each case use the setting PerformanceGoal -> "Quality"
. I show the timings both for setting up the predictor function, and for running it using each method setting. I also evaluate relative errors between predicted and actual values for the test entries. I give a list of the form {fraction worse than 20% off, fraction worse than 40% off, and fraction within 10%}. So we want values in the range 0 to 1 that we hope are of the form {small, very small, large}. I use the methods {"LinearRegression", "RandomForest", "NearestNeighbors", "NeuralNetwork"}
. There is also a `"GaussianProcess" but it spewed errors when I tried it.
AbsoluteTiming[
predictorLR =
Predict[traindata -> trainvalues, Method -> "LinearRegression",
PerformanceGoal -> "Quality"];]
(* Out[175]= {70.920143, Null} *)
AbsoluteTiming[newvalsLR = Map[predictorLR, testdata];]
relerrorsLR =
Abs[(newvalsLR - testvalues)/Sqrt[(newvalsLR^2 + testvalues)/2]];
{Length[Select[relerrorsLR, # > .2 &]],
Length[Select[relerrorsLR, # > .4 &]],
Length[Select[relerrorsLR, # < .1 &]]}/N[m]
(* Out[191]= {11.857996, Null}
Out[193]= {0.229, 0.072, 0.501} *)
AbsoluteTiming[
predictorRF =
Predict[traindata -> trainvalues, Method -> "RandomForest",
PerformanceGoal -> "Quality"];]
(* Out[179]= {163.405545, Null} *)
AbsoluteTiming[newvalsRF = Map[predictorRF, testdata];]
relerrorsRF =
Abs[(newvalsRF - testvalues)/Sqrt[(newvalsRF^2 + testvalues)/2]];
{Length[Select[relerrorsRF, # > .2 &]],
Length[Select[relerrorsRF, # > .4 &]],
Length[Select[relerrorsRF, # < .1 &]]}/N[m]
(* Out[194]= {13.969906, Null}
Out[196]= {0.195, 0.077, 0.542} *)
AbsoluteTiming[
predictorNN =
Predict[traindata -> trainvalues, Method -> "NearestNeighbors",
PerformanceGoal -> "Quality"];]
(* Out[183]= {68.806587, Null} *)
AbsoluteTiming[newvalsNN = Map[predictorNN, testdata];]
relerrorsNN =
Abs[(newvalsNN - testvalues)/Sqrt[(newvalsNN^2 + testvalues)/2]];
{Length[Select[relerrorsNN, # > .2 &]],
Length[Select[relerrorsNN, # > .4 &]],
Length[Select[relerrorsNN, # < .1 &]]}/N[m]
(* Out[197]= {12.105393, Null}
Out[199]= {0.371, 0.169, 0.366} *)
AbsoluteTiming[
predictorNN2 =
Predict[traindata -> trainvalues, Method -> "NeuralNetwork",
PerformanceGoal -> "Quality"];]
(* Out[187]= {455.508036, Null} *)
AbsoluteTiming[newvalsNN2 = Map[predictorNN2, testdata];]
relerrorsNN2 =
Abs[(newvalsNN2 - testvalues)/Sqrt[(newvalsNN2^2 + testvalues)/2]];
{Length[Select[relerrorsNN2, # > .2 &]],
Length[Select[relerrorsNN2, # > .4 &]],
Length[Select[relerrorsNN2, # < .1 &]]}/N[m]
(* Out[200]= {22.463613, Null}
Out[202]= {0.231, 0.073, 0.505} *)
Here is a summary of the results.
(1) "NearestNeighbors"
gives a poor result relative to the rest. We discard it from further consideration.
(2) "NeuralNetwork"
method is slow.
(3) All three of the better performers have around 7-8% that are worse than 40% off the mark, 20-23% that are worse than 20% off, and 50-54% that come within 10% of correct values.
This strikes me as being a reasonable outcome though again, I do not have enough familiarity with either the methods or this data to say whether one might expect to do better.