Message Boards Message Boards

Regression frustration

Posted 11 years ago

I would like help on what falls into the category of mathematica for dummies, running a multivariate regression on excel-imported data. I have spent many more hours than I would like to admit searching Mathematica’s website resources, watching demonstrations and tutorials, googling and binging, downloading and reading 20 or so pdfs, checking Wolfram community q&as, all in an attempt to find one realistic example. In my teaching and research activities, I routinely use SAS (and limdep and, occasionally, STATA – statistical software programs that enable users to easily input data, deal with missing values, transform data, estimate models with much much less suffering than I have found with Mathematica. I very much like Mathematica’s integrated approach and have used it, along with MathStatica, in teaching graduate courses for the past couple of years. Yet, during this period, I have been very frustrated in not being able to easily integrate ‘real world’ data sets into the courses and into some research related activities. And it is remarkable that, with all of my searching, I have yet to find one straightforward example that accomplishes all of the following (there are gobs of separate examples and toy problems illustrating aspects of each): 1) imports an excel file with headers, missing values (e.g. SAS uses a period (.)), large number of variables where the dependent variable is not the last variable; 2) transforms the data into a form that Mathematica requires; 3) calculates descriptive statistics for all variables, accounting for missing values across several varibles 3) uses LinearModelFit to estimate a multivariate regression model. If anyone can point me to such an example or a resource that includes such detailed examples (Varian's book including Belsley's contribution are not all that helpful for this purpose), I would be most appreciative. If this is not straightforward in Mathematica, then that also would be very useful to know. By the way, an excellent resource for SAS is a publication ‘The Little SAS Book’. Given Mathematica's power and complexity, such a resource for its statistical database users would be seriously welcome. I feel like I’m sitting in a Ferrari and have no idea how to access the gas tank without which knowledge all of the car's power and accessories go untapped! Pat

POSTED BY: Pat McCarthy
15 Replies
Posted 10 years ago

Here's code that works well for my purposes and handles the missing values by eliminating observations with missing values.

sample := Flatten[ Import[paht\sample.xlsx", "xlsx", HeaderLines -> 1], 1] ; Dimensions[sample] TableForm[sample[[1 ;; 10]]]

missingdata[entry] := Not[MatchQ[entry, {_?NumberQ}]] subsetdta = DeleteCases[sample, _?missingdata] ;

regdata = subsetdta[[All, {7, 20, 3, 4, 2}]]; data = regdata ; lm = LinearModelFit[data, {x1, x2, x3, x4}, {x1, x2, x3, x4}]

Although part this was domain specific, a greater part is familiarity with the appropriate commands and functionality of mathematica. This is why I believe that a few mathematica examples similar to that above would be very helpful to users such as myself that identify the commands need in going from data importing through a multiple regression problem. Thanks for pointing me in the right direction.

POSTED BY: Pat McCarthy
Posted 10 years ago

You're correct, which is why I'm spending as much time as I am trying to understand mathematica's language and functionalities.

POSTED BY: Pat McCarthy

It seems to me that you have certain assumptions about Mathematica's behavior based on your experience with R, S, or SAS.

For example, Mathematica does not give the special treatment of Missing[___] in the same way R and S do of NA.

As I mentioned in my first post in this discussion, R, S, and SAS are domain specific languages, their style and structure make more sense after taking a statistics class (or two). Mathematica has both (i) a powerful, general functional programming language, and (ii) functionalities for different scientific and mathematical sub-cultures. Because Mathematica is a general system for mathematical and technological computations, certain out of the box behavior would not fit expectations based upon R, S, or SAS experiences.

POSTED BY: Anton Antonov
Posted 10 years ago

I did not think the code worked because I issued a dimensions command which identified the same dimension. However, I now realize what you mean by messing up the table. So yes, both codes work. I also tried the second version but replacing '0' with Missing[na]. That also worked and I was hoping that if I then issued a Mean command, the program would give the mean based upon non-missing variables. This didn't happen so presumably the program is not seeing Missing[na] as a missing value. So I need to look into this more. Thanks again.

POSTED BY: Pat McCarthy
Posted 10 years ago

Ok. Then I must have made a mistake. I'll redo and let you know. And apologies for all the trouble. Pat

POSTED BY: Pat McCarthy
Posted 10 years ago

Thanks. Unfortunately, neither of these works. I'm currently exploring SemanticImport which imports the data correctly and recognizes the headers. But again having trouble with the missing values. If I give the command

testdata = SemanticImport["path\test.xlsx"], the data import correctly including variable names. But when I issue the command (or some slight variations),

testdata = SemanticImport["path\test.xlsx" , "MissingDataRules" -> <|"wage" -> {"na" -> Missing["xx"]}, "lwage" -> {"na" -> Missing["rr"]}|>]

trying to replace na with a mathematica missing value (Missing [xx] and Missing [rr]), the program returns a $Failed. Still stuck in a missing value limbo. Pat

POSTED BY: Pat McCarthy

Both commands I mentioned in my previous post worked on the data in the notebook you provided.

POSTED BY: Anton Antonov

Looking at the table of values at the end of your notebook, instead of

DeleteCases[tstmdl, "na", Infinity]

you should use

DeleteCases[tstmdl, "\"na\"", Infinity]

This though would break the shape of your data, so you might be better of using

tstmdl /. "\"na\"" -> 0
POSTED BY: Anton Antonov

I have attached a notebook to this response that goes through the steps of building a regression model with LinearModelFit and using it for classification. One important question is how to separate the regression model values so we can obtain the best possible classification rates. In the notebook this is done using ROC. (See http://en.wikipedia.org/wiki/Receiver_operating_characteristic .)

LinearModelFit has several signatures. For the data we have I think the most convenient one is LinearModelFit[{m,v}] .

In order to keep the exposition simple in the notebook the regression is done with the two numerical columns "education-num" and "hours-per-week". With the replacement rules {"<=50K"->0,">50K"->1} we convert the data column "income" into a vector of 0's and 1's.

In the attached notebook we call positive the income values ">50K" and negative the income values "<=50K".

The result of LinearModelFit is a function based on the training set of data. We can plot a histogram of values from the regression model, and then we pick a threshold above which the model values are considered to be 1's (and hence ">=50K").

In the attached notebook the first example of using the result of LinearModelFit is extended with a more systematic approach of determining the best threshold to separate the regression model values. The ROC functions Positive Predictive Value (PPV), Negative Predictive Value (NPV), True Positive Value (TPV), accuracy (ACC), and specificity (SPC).

enter image description here enter image description here

Attachments:
POSTED BY: Anton Antonov
Posted 11 years ago

Anton, thanks for the additional code and output, again very helpful in answering my original question. Also, I tried your suggestion on the delete cases but that didn't work. I'm attaching the output. The code is by no means elegant but this is a work in progress. Pat

Attachments:
POSTED BY: Pat McCarthy

I have attached an update of the notebook that describes the use of LinearModelFit with ROC. In the updated notebook one more column, "age", of the "Adult" dataset is used to do the regression, and some minor additional comments and links are included.

Attachments:
POSTED BY: Anton Antonov
Posted 11 years ago

Anton, this is quite helpful and I have adapted your code to successfully load and reconfigure the data, at least to the extent of getting the dep variable in the right spot. This leads to one immeidate question. With large datasets, as your adult dataset, do all of the explanatory variables need to be identified in the LInearModelfit command? Every example I see does this and when I tried to estimate a simple model, I received an error that the number of coordinates was more than the number of variables. Thanks. Pat

POSTED BY: Pat McCarthy

Thanks, Pat.

Here are several data reading and importing steps (using the "Adult" dataset) that I hope you will find concrete and useful.

In[4]:= lines = Import["~/Data sets/adult/adult.data"];
lines = Select[lines, Length[#] > 3 &];
Dimensions[lines]

Out[6]= {32561, 15}

In[7]:= linesTest = Import["~/Data sets/adult/adult.test"];
linesTest = Select[linesTest, Length[#] > 3 &];
Dimensions[linesTest]

Out[9]= {16281, 15}

In[10]:= columnNames = 
 StringSplit[
  "age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country", ","]

Out[10]= {"age", "workclass", "fnlwgt", "education", "education-num",  "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"}

In[11]:= AppendTo[columnNames, "income"]

Out[11]= {"age", "workclass", "fnlwgt", "education", "education-num",  "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"}

In[12]:= Magnify[
 TableForm[lines[[1 ;; 12]], 
  TableHeadings -> {Automatic, 
    Style[#, Blue, FontFamily -> "Times"] & /@ columnNames}], 0.9]

Is this post in a direction you would like this discussion to go?

POSTED BY: Anton Antonov
Posted 11 years ago

Anton, thanks for your reply and the references. I am familiar with the examples similar to the weather data and the census example is much closer to what I have in mind. I will be working through this. I understand that the software programs I mentioned are domain specific and that Mathematica is closer to Gauss or Matlab. Yet even for the domain specific software, one must understand the specifics of how data are read in etc. Once inside the 'box' with one's data, it then becomes much easier to exploit the functionality of the program. Pat

POSTED BY: Pat McCarthy

I do have experience with importing large Excel, SQL, or text data sets in Mathematica and using classification or regression methods over that data.

I have described one such activity, similar to the scenario you outlined, in this blog post: http://mathematicaforprediction.wordpress.com/2014/03/30/classification-and-association-rules-for-census-income-data/ .

And here is a blog post showing analysis using Mathematica's weather data access functions: http://mathematicaforprediction.wordpress.com/2014/01/13/estimation-of-conditional-density-distributions/ .

I think your frustration and complaint are valid, but you also have them because you have used SAS or STATA, which are domain specific languages, not general systems for mathematical computations and visualization like Mathematica.

POSTED BY: Anton Antonov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract