Group Abstract

Message Boards

WOLFRAM COMMUNITY

5.9K Views

2 Replies

12 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Mathematica Wolfram Language Optimization Statistics and Probability

Re-levelling in Generalized Linear Models

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 11 years ago

Hi, i have a question relating to generalized linear models. The Wolfram language has very powerful tools for the analysis of data built right into it. A really powerful tool is GeneralizedLinearModelFit. Generalized linear models are generalisations of the standard linear regression. They allow for non-normal distributions and appear to rapidly replace ANOVA. They are extremely useful to detect relationships among variables that allow to predict a target quantity. On this website they describe the following example: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable. This is not a very unrealistic setting. They use R to analyse the data. The same can be achieved very easily with Mathematica, too, like this: data = Import["http://www.ats.ucla.edu/stat/data/binary.csv"]; GeneralizedLinearModelFit[Reverse /@ data[[2 ;;]], {x, y, z}, {x, y, z}, ExponentialFamily -> "Binomial", NominalVariables -> x]["ParameterTable"] The Reverse function is just to make sure that Mathematica gets the variables in the order it needs. Mathematica's result looks like this: This seems to be correct but differs from the output of R which is: The difference seems to be due to the nominal variable x, i.e. the variable for which four different choices for x: 1 - 4. These variables are not independent. Each student has one of the four. So knowing that it is not any of three choices allows you to conclude that it must be the fourth. In these situations both Mathematica and R choose one of the variables as a reference. In this particular case Mathematica chooses x[4], i.e. the case when x is 4, and R chooses x[1] as base level. In R there is a way to "re-level" this, so that the reference can be chosen by the user. I have tried to find a similar option in Mathematica, but I could not find one. Does anybody know what that option is? Otherwise Mathematica and R obtain a similar solution, see for example all the Standard Errors and the values for GRE and GPE, resp, y and z. These estimates allow us to draw a conclusion about the relevance of the different variables: a larger value of the "Estimate" indicates a stronger influence. The signs indicate the direction of the influence. So in this case, coming from a rather low-level institution has a large negative impact on the probability to be admitted to the graduate school. This seems to be more important than the GPA or GRE. Thanks for your help, Marco PS: If any developers read this, it would be cool to be able to do mixed effect models in Mathematica, too.

POSTED BY: Marco Thiel

2 Replies

Sort By:

Jim Baldwin

Jim Baldwin, Retired

Posted 11 years ago

Because you've excellently and correctly explained the differences in how Mathematica and R handle nominal variables, I assume the question is how to have flexibility in adding in the necessary constraints to make the design matrix non-singular. I, too, would like to see some flexibility in that. For this particular example, one could get Mathematica to reproduce the R analysis by changing the sign of the "rank" variable: data[[All, 4]] = -data[[All, 4]]; Likely Mathematica uses an alphabetic ordering for non-numeric nomimal variables so you'd need to do something different to get one of the levels coded so that it comes out first in the sorting order. An alternative is to obtain coefficients for all four of the ranks is to remove the intercept with IncludeConstantBasis -> False That gives a coefficient for each level of the rank variable and is sometimes more easily interpreted especially when there isn't a natural level to choose as a base level. Objectives can differ even for the same set of data so, yes, more easy-to-use flexibility would be very useful. (And having more mixed models would be even better. But programming robust algorithms for fitting such models - many times with inadequate data - is a great challenge. Clear and extensive documentation would be essential.)

POSTED BY: Jim Baldwin

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 11 years ago

Dear Jim, thank you very much for your reply. It was very useful. Your suggestion to use data[[All, 4]] = -data[[All, 4]]; had already come to mind, but I use data[[2 ;;]] /. {a_, b_, c_, d_} /; d == 1 -> {a, b, c, "zz1"} instead, as it gives more flexibility as to which variable to use as a base. The idea, however, is the same as yours: using the fact that Mathematica lexicographical ordering and then just "renaming" the variables. My complete GLM then looks like this: GeneralizedLinearModelFit[Reverse /@ (data[[2 ;;]] /. {a_, b_, c_, d_} /; d == 1 -> {a, b, c, "zz1"}), {x, y, z}, {x, y, z}, ExponentialFamily -> "Binomial",NominalVariables -> x]["ParameterTable"] In this case I obtain the exact same result as shown on the website for R: I still believe that this is a rather "dirty" fix and it would be nice if there was a built-in option, perhaps an undocumented one. It kind of lacks the elegance of a typical Mathematica command. I am aware that it is quite easy to convert all these numbers into one another, but as you say, it would be nice to have an option for that. Using IncludeConstantBasis -> False is similar to the renaming of the variables. It is useful, but not as elegant and for some applications it would be nice to use the ConstantBasis. Regarding the mixed models you are right, too. As you say, they are not easy to implement. At the moment I use Mathematica's RLink to use the same functionality of R. I would rather have everything within the framework of Mathematica. As a research statistician at the Pacific Southwest Research Station you must come across many situations where both GLMs and Mixed Effect Models can be quite useful. There is a whole range of really cool topics you are working on. You must be using an incredible wealth of modelling and statistics. Thanks again for your reply. Cheers, Marco PS: I really like your bat occupancy model in the Demonstrations.

POSTED BY: Marco Thiel

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback