It turns out I can improve somewhat on the values shown in my response using Predict
. For this I use a variant of a method presented at SYNASC 2016, " Linking Fourier and PCA Methods for Image Look-up".
http://synasc.ro/2016/list-of-papers/
The paper should appear in a few months. For now I think I should defer to that in terms of detailed explanations. But here is the rough idea.
(1) For each data item, take a discrete cosing transform. Keep only some low dimensional components.
(2) Flatten the retained DCT matrix so each data item is now a vector.
(3) Use the singular values decomposition on the matrix of flattened dcts that comes from the training set of data items. Keep only the largest singular values and corresponding singular vector matrices.
(4) Use the result to create a lookup function. Apply it to test data elements to find some number of "closest" training vectors to each test element. Use the values of those nearby training elements to guess the test values.
The actual code is fairly simple and quite fast.
nearestImages[dctvecs_, vals_, keep_] :=
Module[
{uu, ww, vv, udotw, norms},
{uu, ww, vv} = SingularValueDecomposition[dctvecs, keep];
udotw = uu.ww;
norms = Map[Sqrt[#.#] &, udotw];
udotw = udotw/norms;
Clear[uu, ww];
{Nearest[udotw -> vals(*,Method\[Rule]"KDTree"*)], vv}]
processInput[dctvecs_, vv_] :=
Module[
{tdotv, norms},
tdotv = dctvecs.vv;
norms = Map[Sqrt[#.#] &, tdotv];
tdotv = tdotv/norms;
tdotv]
guesses[nf_, tvecs_, n_] := Module[
{nbrs, rng = Range[N@n, 1., -1]^2},
nbrs = Map[nf[#, n] &, tvecs];
Map[rng.# &, nbrs]/Total[rng]
]
Since this is fast, at least with the parameter settings I found to work well (based on smaller test runs), I use 10000 images for training.
n = 10000;
traindata = dat[[1 ;; n, 1]];
trainvalues = dat[[1 ;; n, 2]];
m = 1000;
testdata = dat[[n + 1 ;; n + m, 1]];
testvalues = dat[[n + 1 ;; n + m, 2]];
keep = 4;
dn = 4;
dct = 2;
AbsoluteTiming[
traindctvecs =
Table[Flatten[
Map[FourierDCT[#, dct] &, traindata[[j]]][[All, 1 ;; dn, 1 ;; dn]]],
{j, Length[traindata]}];
{nfunc, vv} =
nearestImages[traindctvecs, trainvalues, keep];]
AbsoluteTiming[
testdctvecs =
Table[Flatten[
Map[FourierDCT[#, dct] &, testdata[[j]]][[All, 1 ;; dn, 1 ;; dn]]],
{j, Length[testdata]}];
testvecs = processInput[testdctvecs, vv];]
newvalsNI = guesses[nfunc, testvecs, 140];
relerrorsNI =
Abs[(newvalsNI - testvalues)/Sqrt[(newvalsNI^2 + testvalues)/2]];
{Length[Select[relerrorsNI, # > .2 &]],
Length[Select[relerrorsNI, # > .4 &]],
Length[Select[relerrorsNI, # < .1 &]]}/N[m]
(* Out[403]= {29.446372, Null}
Out[404]= {1.980268, Null}
Out[407]= {0.11, 0.031, 0.681} *)
So now we have 11% that are off by more than 20% of the correct values, 68% are within a tenth of the correct values, and only 3% have errors in excess of 40% of corresponding reference values. And it was several times faster than any of the Predict
methods, when they trained on 2000 data values. Not bad for the end of the working day.
Since the speed is reasonable, one could of course do a number of randomized tests using RandomSample to separate the input into training vs test data. That could help to give some idea of the variance of this method, that is to say, how far it might stray from the values shown above.