Hi All,
This blog post discusses the application of several algorithms for analysis of census income data:
http://mathematicaforprediction.wordpress.com/2014/03/30/classification-and-association-rules-for-census-income-data/I used the same data in a previous
discussion about mosaic plots because of data's categorical variables.
Here is a table of the histograms for age, education-num, and hours-per-week:
The two classifiers used are (1) decision trees and (2) naive Bayesian classifiers. Both classifiers are trained with the same training data set, and tested with the same test data set. With each of classifier I measured the classification success rates after shuffling each of the columns in the test data. (Every time only one column is shuffled.)
Here is comparison of how much worse the success rates become after the shuffling:
I had to "categorize" the numerical columns in order to be able to apply the
Association rules learning algorithm Apriori.
Here is a table with (some) of the rules with highest confidence:
The confidence of an association rule A->C with antecedent A and consequent C is defined to be the ratio P(A and C) / P(C). The higher the ratio the more confidence we have in the rule. (If the ratio is 1 we have a logical rule, C in A.)
Here is a table showing the rules with highest confidence for the consequent being "<=50K":
The analysis confirmed (and quantified) what is considered common sense:
Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).
Using the association rules we see for example that(1) if a person earns more than $50000 he is very likely to be a married man with large number of years of education;(2) single parents, younger than 25 years, who studied less than 10 years, and were never-married make less than $50000.