Message Boards

WOLFRAM COMMUNITY

12056 Views

15 Replies

7 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Regression frustration

Pat McCarthy

Posted 10 years ago

POSTED BY: Pat McCarthy

15 Replies

Sort By:

Pat McCarthy

Posted 10 years ago

POSTED BY: Pat McCarthy

Pat McCarthy

Posted 10 years ago

POSTED BY: Pat McCarthy

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 10 years ago

It seems to me that you have certain assumptions about Mathematica's behavior based on your experience with R, S, or SAS. For example, Mathematica does not give the special treatment of Missing[___] in the same way R and S do of NA. As I mentioned in my first post in this discussion, R, S, and SAS are domain specific languages, their style and structure make more sense after taking a statistics class (or two). Mathematica has both (i) a powerful, general functional programming language, and (ii) functionalities for different scientific and mathematical sub-cultures. Because Mathematica is a general system for mathematical and technological computations, certain out of the box behavior would not fit expectations based upon R, S, or SAS experiences.

POSTED BY: Anton Antonov

Pat McCarthy

Posted 10 years ago

I did not think the code worked because I issued a dimensions command which identified the same dimension. However, I now realize what you mean by messing up the table. So yes, both codes work. I also tried the second version but replacing '0' with Missing[na]. That also worked and I was hoping that if I then issued a Mean command, the program would give the mean based upon non-missing variables. This didn't happen so presumably the program is not seeing Missing[na] as a missing value. So I need to look into this more. Thanks again.

POSTED BY: Pat McCarthy

Pat McCarthy

Posted 10 years ago

Ok. Then I must have made a mistake. I'll redo and let you know. And apologies for all the trouble. Pat

POSTED BY: Pat McCarthy

Pat McCarthy

Posted 10 years ago

Thanks. Unfortunately, neither of these works. I'm currently exploring SemanticImport which imports the data correctly and recognizes the headers. But again having trouble with the missing values. If I give the command testdata = SemanticImport["path\test.xlsx"], the data import correctly including variable names. But when I issue the command (or some slight variations), testdata = SemanticImport["path\test.xlsx" , "MissingDataRules" -> <\|"wage" -> {"na" -> Missing["xx"]}, "lwage" -> {"na" -> Missing["rr"]}\|>] trying to replace na with a mathematica missing value (Missing [xx] and Missing [rr]), the program returns a $Failed. Still stuck in a missing value limbo. Pat

POSTED BY: Pat McCarthy

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 10 years ago

Both commands I mentioned in my previous post worked on the data in the notebook you provided.

POSTED BY: Anton Antonov

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 10 years ago

Looking at the table of values at the end of your notebook, instead of DeleteCases[tstmdl, "na", Infinity] you should use DeleteCases[tstmdl, "\"na\"", Infinity] This though would break the shape of your data, so you might be better of using tstmdl /. "\"na\"" -> 0

POSTED BY: Anton Antonov

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 10 years ago

I have attached a notebook to this response that goes through the steps of building a regression model with LinearModelFit and using it for classification. One important question is how to separate the regression model values so we can obtain the best possible classification rates. In the notebook this is done using ROC. (See http://en.wikipedia.org/wiki/Receiver_operating_characteristic .) LinearModelFit has several signatures. For the data we have I think the most convenient one is LinearModelFit[{m,v}] . In order to keep the exposition simple in the notebook the regression is done with the two numerical columns "education-num" and "hours-per-week". With the replacement rules {"<=50K"->0,">50K"->1} we convert the data column "income" into a vector of 0's and 1's. In the attached notebook we call positive the income values ">50K" and negative the income values "<=50K". The result of LinearModelFit is a function based on the training set of data. We can plot a histogram of values from the regression model, and then we pick a threshold above which the model values are considered to be 1's (and hence ">=50K"). In the attached notebook the first example of using the result of LinearModelFit is extended with a more systematic approach of determining the best threshold to separate the regression model values. The ROC functions Positive Predictive Value (PPV), Negative Predictive Value (NPV), True Positive Value (TPV), accuracy (ACC), and specificity (SPC). Attachments:

POSTED BY: Anton Antonov

Pat McCarthy

Posted 10 years ago

Anton, thanks for the additional code and output, again very helpful in answering my original question. Also, I tried your suggestion on the delete cases but that didn't work. I'm attaching the output. The code is by no means elegant but this is a work in progress. Pat Attachments:

POSTED BY: Pat McCarthy

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 10 years ago

Attachments:

POSTED BY: Anton Antonov

Pat McCarthy

Posted 10 years ago

Anton, this is quite helpful and I have adapted your code to successfully load and reconfigure the data, at least to the extent of getting the dep variable in the right spot. This leads to one immeidate question. With large datasets, as your adult dataset, do all of the explanatory variables need to be identified in the LInearModelfit command? Every example I see does this and when I tried to estimate a simple model, I received an error that the number of coordinates was more than the number of variables. Thanks. Pat

POSTED BY: Pat McCarthy

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 10 years ago

POSTED BY: Anton Antonov

Pat McCarthy

Posted 10 years ago

Anton, thanks for your reply and the references. I am familiar with the examples similar to the weather data and the census example is much closer to what I have in mind. I will be working through this. I understand that the software programs I mentioned are domain specific and that Mathematica is closer to Gauss or Matlab. Yet even for the domain specific software, one must understand the specifics of how data are read in etc. Once inside the 'box' with one's data, it then becomes much easier to exploit the functionality of the program. Pat

POSTED BY: Pat McCarthy

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 10 years ago

I do have experience with importing large Excel, SQL, or text data sets in Mathematica and using classification or regression methods over that data. I have described one such activity, similar to the scenario you outlined, in this blog post: http://mathematicaforprediction.wordpress.com/2014/03/30/classification-and-association-rules-for-census-income-data/ . And here is a blog post showing analysis using Mathematica's weather data access functions: http://mathematicaforprediction.wordpress.com/2014/01/13/estimation-of-conditional-density-distributions/ . I think your frustration and complaint are valid, but you also have them because you have used SAS or STATA, which are domain specific languages, not general systems for mathematical computations and visualization like Mathematica.

POSTED BY: Anton Antonov

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback