Message Boards Message Boards

3
|
4690 Views
|
3 Replies
|
4 Total Likes
View groups...
Share
Share this post:

Best way to identify the best combination of variables in Predict?

Dear All,

The goal is to achieve the highest accuracy for a specific combination of the input variables. For example, I only evaluated three different combinations of the input, while we have other combinations like y=f(x1,x4), y=f(x3,x4,x5), y=f(x3), and so on.

As you know evaluation of all possible combinations is time-consuming. Is there any way to get the best combination of inputs?

Any help would be greatly appreciated.

y = {7.56, 3.79, 2.85, 8.47, 1.37, 5.16, 3.83, 6.58, 6.14, 5.82};
x1 = {1.7, 0.67, 0.5, 7.9, 5.5, 0.81, 6.9, 4.6, 8.2, 8.1};
x2 = {9.02, 0.85, 1.09, 3.37, 8.64, 6.72, 0.62, 7.12, 7.42, 2.03};
x3 = {2.69, 2.04, 3.12, 2.09, 0.89, 7.82, 7.56, 2.24, 7.25, 3.44};
x4 = {6.01, 2.73, 5.35, 7.33, 9.38, 9.94, 1.19, 5.05, 9.39, 8.86};
x5 = {0.84, 2.31, 4.42, 4.18, 8.46, 3.02, 9.09, 6.14, 4.10, 7.15};

(*Scenario 1 : y=f(x1) *)

tuples1 = Thread[Rule[Transpose[{x1}], y]];

train = Take[tuples1, 7];

test = Take[tuples1, -3];

cfunc = Predict[train, Method ->  NeuralNetwork  , 
   PerformanceGoal ->  Quality  ];

predictOnTrained = Map[cfunc, train[[All, 1]]];

predictOnTest = Map[cfunc, test[[All, 1]]];

actualOnTrained = train[[All, 2]];

actualOnTest = test[[All, 2]];

RootMeanSquare[actualOnTest - predictOnTest];

Correlation[actualOnTest, predictOnTest];

(*Scenario 2 : y=f(x1,x2) *)

tuples2 = Thread[Rule[Transpose[{x1, x2}], y]];

train = Take[tuples2, 7];

test = Take[tuples2, -3];

cfunc = Predict[train, Method ->  NeuralNetwork  , 
   PerformanceGoal ->  Quality  ];

predictOnTrained = Map[cfunc, train[[All, 1]]];

predictOnTest = Map[cfunc, test[[All, 1]]];

actualOnTrained = train[[All, 2]];

actualOnTest = test[[All, 2]];

RootMeanSquare[actualOnTest - predictOnTest]

Correlation[actualOnTest, predictOnTest]

(*Scenario 3 : y=f(x1,x2,x3) *)

tuples3 = Thread[Rule[Transpose[{x1, x2, x3}], y]];

train = Take[tuples3, 7];

test = Take[tuples3, -3];

cfunc = Predict[train, Method ->  NeuralNetwork  , 
   PerformanceGoal ->  Quality  ];

predictOnTrained = Map[cfunc, train[[All, 1]]];

predictOnTest = Map[cfunc, test[[All, 1]]];

actualOnTrained = train[[All, 2]];

actualOnTest = test[[All, 2]];

RootMeanSquare[actualOnTest - predictOnTest]

Correlation[actualOnTest, predictOnTest]
POSTED BY: M.A. Ghorbani
3 Replies
Posted 2 years ago

Selecting the model with the largest $R^2$ or equivalently the smallest root mean square error is not a good practice when you are comparing lots of models (especially with all-possible subsets regression) with different numbers of predictors. Why? Any random predictor added will always increase the $R^2$ value and decrease the root mean square.

POSTED BY: Jim Baldwin
Posted 2 years ago

I know the text below doesn't answer your question about Predict but does address all-possible subsets regression.

Also, the term "best" is not specific enough. For example, You might want to consider choosing the model with the smallest $AIC_c$ value rather than the largest $R^2$ value. And finally, you might want to consider model averaging where you don't choose a single model but rather a weighted average of a set of models.

If your question was "How do I efficiently find the best linear regression of all possible subsets linear regression without evaluating every subset?", then there are several approaches. (Although, with even a moderate number of predictor variables, this can get out of hand easily.)

One article to look at is Exact Variable-Subset Selection in Linear Regression for R.

Alternatively, performing all-possible subsets is not often recommended statistical advice. You should consider following advice from Frank Harrell which is probably the best source. His book "Regression Modeling Strategies" and class notes are more than excellent.

POSTED BY: Jim Baldwin

Dear Jim,

Thank you so much for the useful explains and for introducing the excellent references. Certainly, I will study them.

I mean was using an iteration method for achieving the best combination based on the high correlation and low root mean square error.

The program chooses the combinations themselves and gives us the best. This issue is very important in civil and environmental engineering and many other sciences .

Again I appreciate your time.

POSTED BY: M.A. Ghorbani
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract