Message Boards Message Boards

2
|
7154 Views
|
7 Replies
|
10 Total Likes
View groups...
Share
Share this post:

Random selection of elements for getting a high correlation

Hi,

How do I randomly select 12 elements of data for the training list to obtain the highest correlation coefficient for the testing list (CCtest)?

data = {{0.`, 0.048}, {0.2`, 0.424}, {0.4`, 0.943}, {0.60, 1.177}, {0.8`, 
    1.475}, {1.`, 1.839}, {1.200, 2.134}, {1.400, 2.395}, {1.6`, 
    2.564}, {1.8`, 2.814}, {2.`, 2.981}, {2.2`, 2.972}, {2.400, 3.133}, {2.6`,
     2.99}, {2.80, 3.190}, {3.`, 3.184}};

train = Take[data, 12]

{{0., 0.048}, {0.2, 0.424}, {0.4, 0.943}, {0.6, 1.177}, {0.8, 1.475}, {1., 
  1.839}, {1.2, 2.134}, {1.4, 2.395}, {1.6, 2.564}, {1.8, 2.814}, {2., 
  2.981}, {2.2, 2.972}}

lm = LinearModelFit[train, x, x];

gg[x_, y_] := 0.308+1.36x

YPtrain = Map[gg[#[[1]], #[[2]]] &, train]

{0.308, 0.58, 0.852, 1.124, 1.396, 1.668, 1.94, 2.212, 2.484, 2.756, 3.028, \
3.3}

Ytrain = train[[All, 2]]

{0.048, 0.424, 0.943, 1.177, 1.475, 1.839, 2.134, 2.395, 2.564, 2.814, 2.981, \
2.972}

CCtrain = Correlation[YPtrain, Ytrain]

0.985044

Evaluation of  0.308+1.36 x  for the testing data:

test = Take[data, -4]

{{2.4, 3.133}, {2.6, 2.99}, {2.8, 3.19}, {3., 3.184}}

YPtest = Map[gg[#[[1]], #[[2]]] &, test]

{3.572, 3.844, 4.116, 4.388}

Ytest = test[[All, 2]]

{3.133, 2.99, 3.19, 3.184}

CCtest = Correlation[YPtest, Ytest]

0.489591
POSTED BY: M.A. Ghorbani
7 Replies

See if this is the goal:

data = {{0.`, 0.048}, {0.2`, 0.424}, {0.4`, 0.943}, {0.60, 
    1.177}, {0.8`, 1.475}, {1.`, 1.839}, {1.200, 2.134}, {1.400, 
    2.395}, {1.6`, 2.564}, {1.8`, 2.814}, {2.`, 2.981}, {2.2`, 
    2.972}, {2.400, 3.133}, {2.6`, 2.99}, {2.80, 3.190}, {3.`, 3.184}};

corr[t1_, t2_] := 
 Module[{train, lm, test, cx, gg, YPtrain, Ytrain, CCtrain, YPtest, 
   Ytest, CCtest, x}, train = SortBy[RandomSample[data, t1], First]; 
  lm = LinearModelFit[train, x, x]; 
  cx = CoefficientList[Normal@lm, x]; 
  gg[x_, y_] := (cx[[1]] + cx[[2]]*x); 
  YPtrain = Map[gg[#[[1]], #[[2]]] &, train]; 
  Ytrain = train[[All, 2]]; CCtrain = Correlation[YPtrain, Ytrain]; 
  test = SortBy[RandomSample[train, t2], First]; 
  YPtest = Map[gg[#[[1]], #[[2]]] &, test]; Ytest = test[[All, 2]]; 
  CCtest = Correlation[YPtest, Ytest]; 
  Do[Print[{{Style[ToString[cx[[1]] + cx[[2]]*"x"], 
        Purple]}, {{"YPtrain", YPtrain}, {"Ytrain", 
        Ytrain}, {Style["CCtrain", Blue], 
        Style[CCtrain, Blue]}}, {{"YPtest", YPtest}, {"Ytest", 
        Ytest}, {Style["CCtest", Red], Style[CCtest, Red]}}, 
      ListLinePlot[{YPtrain, Ytrain}, 
       PlotLegends -> {"YPTrain", "Ytrain"}], 
      ListLinePlot[{YPtest, Ytest}, 
       PlotLegends -> {"YPtest", "Ytest"}]}[[z]]], {z, 1, 5}]]

So:

corr[12, 4]

im1

And to test many times you can use, for example, Table[]:

n = 5; Table[corr[12, 4], n]

Did I understand what you want to do?.. see if it helped..

POSTED BY: Claudio Chaib
Posted 4 years ago

That is a great solution Chaib, Congratulation!

For example, if we assume n=20, the code can recognize the best training and testing list?

POSTED BY: Alex Teymouri

Yes, it can be done like this:

(In that case, I maximize the "CCtest")

corrN[t1_, t2_, n_] := 
 Module[{train, lm, test, cx, gg, YPtrain, Ytrain, CCtrain, YPtest, 
   Ytest, CCtest, x}, 
  MaximalBy[
   Table[train = SortBy[RandomSample[data, t1], First]; 
    lm = LinearModelFit[train, x, x]; 
    cx = CoefficientList[Normal@lm, x]; 
    gg[x_, y_] := (cx[[1]] + cx[[2]]*x); 
    YPtrain = Map[gg[#[[1]], #[[2]]] &, train]; 
    Ytrain = train[[All, 2]]; CCtrain = Correlation[YPtrain, Ytrain]; 
    test = SortBy[RandomSample[train, t2], First]; 
    YPtest = Map[gg[#[[1]], #[[2]]] &, test]; Ytest = test[[All, 2]]; 
    CCtest = 
     Correlation[YPtest, Ytest]; {Style[CCtest, Red] -> 
      Style["CCtest", Red], 
     Style[CCtrain, Blue] -> Style["CCtrain", Blue], {"YPtrain", 
      YPtrain}, {"Ytrain", Ytrain}, {"YPtest", YPtest}, {"YTest", 
      Ytest}, {Style[ToString[cx[[1]] + cx[[2]]*"x"], Purple]}, 
     ListLinePlot[{YPtrain, Ytrain}, 
      PlotLegends -> {"YPTrain", "Ytrain"}, ImageSize -> Medium], 
     ListLinePlot[{YPtest, Ytest}, PlotLegends -> {"YPtest", "Ytest"},
       ImageSize -> Medium]}, n], First]]

With n=20:

corrN[12, 4, 20]

im2

And below, a way to maximize "CCtrain" and "CCtest" at the same time:

corrAll[t1_, t2_, n_] := 
 Module[{ff, train, lm, test, cx, gg, YPtrain, Ytrain, CCtrain, 
   YPtest, Ytest, CCtest, x, vc}, 
  vc = Table[train = SortBy[RandomSample[data, t1], First]; 
    lm = LinearModelFit[train, x, x]; 
    cx = CoefficientList[Normal@lm, x]; 
    gg[x_, y_] := (cx[[1]] + cx[[2]]*x); 
    YPtrain = Map[gg[#[[1]], #[[2]]] &, train]; 
    Ytrain = train[[All, 2]]; CCtrain = Correlation[YPtrain, Ytrain]; 
    test = SortBy[RandomSample[train, t2], First]; 
    YPtest = Map[gg[#[[1]], #[[2]]] &, test]; Ytest = test[[All, 2]]; 
    CCtest = Correlation[YPtest, Ytest]; 
    ff = {CCtest, 
      CCtrain, {"YPtrain", YPtrain}, {"Ytrain", Ytrain}, {"YPtest", 
       YPtest}, {"YTest", 
       Ytest}, {Style[ToString[cx[[1]] + cx[[2]]*"x"], Purple]}, 
      ListLinePlot[{YPtrain, Ytrain}, 
       PlotLegends -> {"YPTrain", "Ytrain"}, ImageSize -> Medium], 
      ListLinePlot[{YPtest, Ytest}, 
       PlotLegends -> {"YPtest", "Ytest"}, ImageSize -> Medium]}; {ff,
      ff[[1]]*ff[[2]]}, n]; 
  vc[[Position[vc, MaximalBy[vc, Last][[1]]][[1, 1]], 1]]]

With n=100:

corrAll[12, 4, 100]

im3

..

POSTED BY: Claudio Chaib

Sorry for the late response. I deeply appreciate your efforts and also Alex's suggestion!

Is it possible to avoid overfitting in this model? We should consider only the training and testing list with CCtrain>CCtest. Am I right? May I have your email address?

POSTED BY: M.A. Ghorbani

Ok, in this case, just do something like this before the maximization (example):

y1 = {{CCtest, CCtrain, "..."}, {CCtest, CCtrain, "..."}, {CCtest, 
    CCtrain, "..."}, {CCtest, CCtrain, "..."}, {CCtest, CCtrain, 
    "..."}};
y2 = Table[
  If[y1[[x, 1]] < y1[[x, 2]], y1[[x]], Nothing], {x, 1, Length@y1}]

MaximalBy[y2, First]

Yes...you can contact me on LinkedIn (my profile).

POSTED BY: Claudio Chaib

Hi M.A. Ghorbani,

You want to do something like this?

SortBy[RandomSample[data, 12], First]
POSTED BY: Claudio Chaib

Thanks Claudio.My question explained clearly in the below:

Iteration 1 ->

  • Select 12 elements randomly from main data as a training list (traininglist1) and 4
    elements from the same main data as a testing list (testinglist1).
  • Fit y=a+bx for the traininglist1 and get a1 and b1 parameters, say y=a1+b1.x .
  • With y=a1+b1.x, generate a new list for the testinglist1, say YP1.
  • Compute correlation coefficient between testinglist1 and YP1, say CC1.

Iteration 2 ->

  • Select 12 elements randomly from main data as a training list (traininglist2) and 4
    elements from the same main data as a testing list (testinglist2).

  • Fit y=a+bx for the traininglist2 and get a2 and b2 parameters, say y=a2+b2.x

  • With y=a2+b2.x, generate a new list for the testinglist2, say YP2

-Compute correlation coefficient between testinglist2 and YP2, say CC2.

……………………………..

POSTED BY: M.A. Ghorbani
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract