Message Boards Message Boards

Re-levelling in Generalized Linear Models

Hi,

i have a question relating to generalized linear models. The Wolfram language has very powerful tools for the analysis of data built right into it. A really powerful tool is GeneralizedLinearModelFit. Generalized linear models are generalisations of the standard linear regression. They allow for non-normal distributions and appear to rapidly replace ANOVA. They are extremely useful to detect relationships among variables that allow to predict a target quantity.

On this website they describe the following example:

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.

This is not a very unrealistic setting. They use R to analyse the data. The same can be achieved very easily with Mathematica, too, like this:

data = Import["http://www.ats.ucla.edu/stat/data/binary.csv"];
GeneralizedLinearModelFit[Reverse /@ data[[2 ;;]], {x, y, z}, {x, y, z}, ExponentialFamily -> "Binomial", NominalVariables -> x]["ParameterTable"]

The Reverse function is just to make sure that Mathematica gets the variables in the order it needs. Mathematica's result looks like this:

enter image description here

This seems to be correct but differs from the output of R which is:

enter image description here

The difference seems to be due to the nominal variable x, i.e. the variable for which four different choices for x: 1 - 4. These variables are not independent. Each student has one of the four. So knowing that it is not any of three choices allows you to conclude that it must be the fourth. In these situations both Mathematica and R choose one of the variables as a reference. In this particular case Mathematica chooses x[4], i.e. the case when x is 4, and R chooses x[1] as base level. In R there is a way to "re-level" this, so that the reference can be chosen by the user. I have tried to find a similar option in Mathematica, but I could not find one. Does anybody know what that option is? Otherwise Mathematica and R obtain a similar solution, see for example all the Standard Errors and the values for GRE and GPE, resp, y and z. These estimates allow us to draw a conclusion about the relevance of the different variables: a larger value of the "Estimate" indicates a stronger influence. The signs indicate the direction of the influence. So in this case, coming from a rather low-level institution has a large negative impact on the probability to be admitted to the graduate school. This seems to be more important than the GPA or GRE.

Thanks for your help,

Marco

PS: If any developers read this, it would be cool to be able to do mixed effect models in Mathematica, too.

POSTED BY: Marco Thiel
2 Replies
Posted 10 years ago

Because you've excellently and correctly explained the differences in how Mathematica and R handle nominal variables, I assume the question is how to have flexibility in adding in the necessary constraints to make the design matrix non-singular. I, too, would like to see some flexibility in that.

For this particular example, one could get Mathematica to reproduce the R analysis by changing the sign of the "rank" variable:

data[[All, 4]] = -data[[All, 4]];

Likely Mathematica uses an alphabetic ordering for non-numeric nomimal variables so you'd need to do something different to get one of the levels coded so that it comes out first in the sorting order.

An alternative is to obtain coefficients for all four of the ranks is to remove the intercept with

IncludeConstantBasis -> False

That gives a coefficient for each level of the rank variable and is sometimes more easily interpreted especially when there isn't a natural level to choose as a base level.

Objectives can differ even for the same set of data so, yes, more easy-to-use flexibility would be very useful. (And having more mixed models would be even better. But programming robust algorithms for fitting such models - many times with inadequate data - is a great challenge. Clear and extensive documentation would be essential.)

POSTED BY: Jim Baldwin

Dear Jim,

thank you very much for your reply. It was very useful. Your suggestion to use

data[[All, 4]] = -data[[All, 4]];

had already come to mind, but I use

data[[2 ;;]] /. {a_, b_, c_, d_} /; d == 1 -> {a, b, c, "zz1"}

instead, as it gives more flexibility as to which variable to use as a base. The idea, however, is the same as yours: using the fact that Mathematica lexicographical ordering and then just "renaming" the variables. My complete GLM then looks like this:

GeneralizedLinearModelFit[Reverse /@ (data[[2 ;;]] /. {a_, b_, c_, d_} /; d == 1 -> {a, b, c, "zz1"}), {x, y, z}, {x, y, z}, ExponentialFamily -> "Binomial",NominalVariables -> x]["ParameterTable"]

In this case I obtain the exact same result as shown on the website for R:

enter image description here

I still believe that this is a rather "dirty" fix and it would be nice if there was a built-in option, perhaps an undocumented one. It kind of lacks the elegance of a typical Mathematica command. I am aware that it is quite easy to convert all these numbers into one another, but as you say, it would be nice to have an option for that.

Using IncludeConstantBasis -> False is similar to the renaming of the variables. It is useful, but not as elegant and for some applications it would be nice to use the ConstantBasis.

Regarding the mixed models you are right, too. As you say, they are not easy to implement. At the moment I use Mathematica's RLink to use the same functionality of R. I would rather have everything within the framework of Mathematica.

As a research statistician at the Pacific Southwest Research Station you must come across many situations where both GLMs and Mixed Effect Models can be quite useful. There is a whole range of really cool topics you are working on. You must be using an incredible wealth of modelling and statistics.

Thanks again for your reply.

Cheers,

Marco

PS: I really like your bat occupancy model in the Demonstrations.

POSTED BY: Marco Thiel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract