Message Boards Message Boards

Regression frustration

Posted 10 years ago
POSTED BY: Pat McCarthy
15 Replies
Posted 10 years ago
POSTED BY: Pat McCarthy
Posted 10 years ago
POSTED BY: Pat McCarthy
POSTED BY: Anton Antonov
Posted 10 years ago
POSTED BY: Pat McCarthy
Posted 10 years ago
POSTED BY: Pat McCarthy
Posted 10 years ago
POSTED BY: Pat McCarthy
POSTED BY: Anton Antonov

Looking at the table of values at the end of your notebook, instead of

DeleteCases[tstmdl, "na", Infinity]

you should use

DeleteCases[tstmdl, "\"na\"", Infinity]

This though would break the shape of your data, so you might be better of using

tstmdl /. "\"na\"" -> 0
POSTED BY: Anton Antonov

I have attached a notebook to this response that goes through the steps of building a regression model with LinearModelFit and using it for classification. One important question is how to separate the regression model values so we can obtain the best possible classification rates. In the notebook this is done using ROC. (See http://en.wikipedia.org/wiki/Receiver_operating_characteristic .)

LinearModelFit has several signatures. For the data we have I think the most convenient one is LinearModelFit[{m,v}] .

In order to keep the exposition simple in the notebook the regression is done with the two numerical columns "education-num" and "hours-per-week". With the replacement rules {"<=50K"->0,">50K"->1} we convert the data column "income" into a vector of 0's and 1's.

In the attached notebook we call positive the income values ">50K" and negative the income values "<=50K".

The result of LinearModelFit is a function based on the training set of data. We can plot a histogram of values from the regression model, and then we pick a threshold above which the model values are considered to be 1's (and hence ">=50K").

In the attached notebook the first example of using the result of LinearModelFit is extended with a more systematic approach of determining the best threshold to separate the regression model values. The ROC functions Positive Predictive Value (PPV), Negative Predictive Value (NPV), True Positive Value (TPV), accuracy (ACC), and specificity (SPC).

enter image description here enter image description here

Attachments:
POSTED BY: Anton Antonov
Posted 10 years ago

Anton, thanks for the additional code and output, again very helpful in answering my original question. Also, I tried your suggestion on the delete cases but that didn't work. I'm attaching the output. The code is by no means elegant but this is a work in progress. Pat

Attachments:
POSTED BY: Pat McCarthy
Attachments:
POSTED BY: Anton Antonov
Posted 10 years ago
POSTED BY: Pat McCarthy

Thanks, Pat.

Here are several data reading and importing steps (using the "Adult" dataset) that I hope you will find concrete and useful.

In[4]:= lines = Import["~/Data sets/adult/adult.data"];
lines = Select[lines, Length[#] > 3 &];
Dimensions[lines]

Out[6]= {32561, 15}

In[7]:= linesTest = Import["~/Data sets/adult/adult.test"];
linesTest = Select[linesTest, Length[#] > 3 &];
Dimensions[linesTest]

Out[9]= {16281, 15}

In[10]:= columnNames = 
 StringSplit[
  "age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country", ","]

Out[10]= {"age", "workclass", "fnlwgt", "education", "education-num",  "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"}

In[11]:= AppendTo[columnNames, "income"]

Out[11]= {"age", "workclass", "fnlwgt", "education", "education-num",  "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"}

In[12]:= Magnify[
 TableForm[lines[[1 ;; 12]], 
  TableHeadings -> {Automatic, 
    Style[#, Blue, FontFamily -> "Times"] & /@ columnNames}], 0.9]

Is this post in a direction you would like this discussion to go?

POSTED BY: Anton Antonov
Posted 10 years ago
POSTED BY: Pat McCarthy
POSTED BY: Anton Antonov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract