Group Abstract

Message Boards

WOLFRAM COMMUNITY

16.9K Views

4 Replies

12 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Business Analytics Data Science Mathematica Graphics and Visualization Wolfram Language Packages Statistics and Probability Machine Learning

Contingency tables creation examples

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

Introduction In statistics contingency tables are matrices used to show the co-occurrence of variable values of multi-dimensional data. They are fundamental in many types of research. This document shows how to use several functions implemented in Mathematica for the construction of contingency tables. Code In this document we are going to use the implementations in the package MathematicaForPredictionUtilities.m from MathematicaForPrediction at GitHub, [1]. Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"] The implementation of CrossTabulate in [1] is based on Tally, GatherBy, and SparseArray. The implementation of xtabsViaRLink in [1] is based on R's function xtabs called via RLink. Other package used in this document are [2] and [4] imported with these commands: Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"] Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/RSparseMatrix.m"] For a different approach to implementing cross-tabulation than those taken in [1] see the Stack Overflow answer http://stackoverflow.com/a/8101951 by Mr.Wizard. See my answer in the Community question "Two-way contingency table" using the data from that SO post. Using Titanic data Getting data titanicData = Flatten@List @@@ ExampleData[{"MachineLearning", "Titanic"}, "Data"]; titanicData = DeleteCases[titanicData, {___, _Missing, ___}]; titanicColumnNames = Flatten@List @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"]; aTitanicColumnNames = AssociationThread[titanicColumnNames -> Range[Length[titanicColumnNames]]]; Note that we have removed the records with missing data (for simpler exposition). Data summary Dimensions[titanicData] (* {1046, 4} ) RecordsSummary[titanicData, titanicColumnNames] Using CrossTabulate Assume that we want to group the people according to their passenger class and survival and we want to find the average age for each group. We can do that by 1. finding the counts contingency table $C$ for the variables "passenger class" and "passenger survival", 2. finding the age contingency table $A$ for the same variables, and 3. do the element-wise division $\frac{A}{C}$. ctCounts = CrossTabulate[ titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]; MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts ctTotalAge = CrossTabulate[ titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival", "passenger age"}]]]; MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctTotalAge MatrixForm[ ctTotalAge[[1]]/ Normal[ctCounts[[1]]], TableHeadings -> Values[Rest[ctTotalAge]]] (We have to make the sparse array `ctCounts` a regular array because otherwise we get warnings for division by zero because of the sparse array's default value.) Let us repeat the steps above by dividing the passengers before-hand according to their sex. Association@ Map[ (mCount = CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]; mAge = CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival", "passenger age"}]]]; #[[1, aTitanicColumnNames["passenger sex"]]] -> MatrixForm[mAge[[1]]/Normal[mCount[[1]]], TableHeadings -> Values[Rest[mAge]]]) &, GatherBy[titanicData, #[[aTitanicColumnNames["passenger sex"]]] &]] Using R's xtabs (via RLink) The alternative of CrossTabulate is xtabsViaRLink that is uses R's function xtabs via RLink. Needs["RLink`"] RLinkResourcesInstall[] InstallR[] ( {Paclet[RLinkRuntime,9.0.0.0, <>]} ) ctCounts = FromRXTabsForm@ xtabsViaRLink[ titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]], {"passenger.sex", "passenger.survival"}, " ~ passenger.sex + passenger.survival"]; MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts Relation to mosaic plots The graphical visualization of a dataset with mosaic plots, [2,3], is similar in spirit to contingency tables. Compare the following mosaic plot with the contingency table in the last section. MosaicPlot[ titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]] ] Using larger data Let us consider an example with larger data that has larger number of unique values in its columns. Getting online retail invoices data The following dataset is taken from [6]. data = Import[ "/Volumes/WhiteSlimSeagate/Datasets/UCI Online Retail Data Set/Online Retail.csv"]; columnNames = First[data]; data = Rest[data]; aColumnNames = AssociationThread[columnNames -> Range[Length[columnNames]]]; Data summary We have $\approx 66000$ rows and $8$ columns: Dimensions[data] ( {65499, 8} ) Here is a summary of the columns: Magnify[#, 0.75] &@RecordsSummary[data, columnNames] Contingency tables Country vs. StockCode There is no one-to-one correspondence between the values of the column "Description" and the column "StockCode" which can be seen with this command: MinMax@Map[Length@Union, GatherBy[data[[All, aColumnNames /@ {"Description", "StockCode"}]], First]] (* {1,144} ) The way in which the column "StockCode" was ingested made it have multiple types for its values: Tally[NumberQ /@ data[[All, aColumnNames["StockCode"]]]] ( {{False,9009},{True,56490}} ) So let us convert it to all strings: data[[All, aColumnNames["StockCode"]]] = ToString /@ data[[All, aColumnNames["StockCode"]]]; Here we find the contingency table for "Country" and "StockCode" over "Quantity" using CrossTabulate: AbsoluteTiming[ ctRes = CrossTabulate[ data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]]]; ] ( {0.256339,Null} ) Here we find the contingency table for "Country" and "StockCode" over "Quantity" using xtabsViaRLink: AbsoluteTiming[ rres = xtabsViaRLink[ data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]], {"Country", "StockCode", "Quantity"}, "Quantity ~ Country + StockCode"]; ctRRes = FromRXTabsForm[rres]; ] ( {0.843621,Null} ) Both functions produce the same result: ctRRes["matrix"] == N@ctRRes[[1]] ( True ) Note that `xtabsViaRLink` is slower but still fairly quick. Here we plot the contingency table using MatrixPlot : MatrixPlot[ctRRes["matrix"], AspectRatio -> 1/1.5, FrameTicks -> {{#, #} &@ Table[{i, ctRRes["rownames"][[i]]}, {i, Length[ctRRes["rownames"]]}], {Automatic, Automatic}}, ImageSize -> 550] Country vs. Quarter Let us extend the data with columns that have months and quarters corresponding to the invoice dates. The following commands computing date objects and extracting month and quarter values from them are too slow. (AbsoluteTiming[dobjs=DateObject[{#,{"Month","/","Day","/","Year"," \ ","Hour",":","Minute"}}]&/@data[[All,aColumnNames["InvoiceDate"]]]; ]) ( {30.2595,Null} ) (AbsoluteTiming[ dvals=DateValue[dobjs,{"MonthName","QuarterNameShort"}]; ]) ( {91.1732,Null} ) We can use the following ad hoc computation instead. dvals = StringSplit[#, {"/", " ", ":"}] & /@ data[[All, aColumnNames["InvoiceDate"]]]; This summary shows that the second value in the dates is day of month, and the first value is most likely month. Magnify[#, 0.75] &@ RecordsSummary[dvals[[All, 1 ;; 3]], "MaxTallies" -> 16] These commands extend the data and the corresponding column-name-to-index association. ms = DateValue[Table[DateObject[{2016, i, 1}], {i, 12}], "MonthName"]; dvals = Map[{ms[[#]], "Q" <> ToString[Quotient[#, 4] + 1]} &, ToExpression @ dvals[[All, 1]]]; dataM = MapThread[Join[#1, #2] &, {data, dvals}]; aColumnNamesM = Join[aColumnNames, <\|"MonthName" -> (Length[aColumnNames] + 1), "QuarterNameShort" -> (Length[aColumnNames] + 2)\|>]; ( {0.054877,Null} ) Here is the contingency table for "Country" vs "QuarterNameShort" over "Quantity". ctRes = CrossTabulate[ dataM[[All, aColumnNamesM /@ {"Country", "QuarterNameShort", "Quantity"}]]]; Magnify[#, 0.75] &@ MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctRes Uniform tables Often when making contingency tables over subsets of the data we obtain contingency tables with different rows and columns. For various reasons (programming, esthetics, comprehension) it is better to have the tables with the same rows and columns. Here is an example of non-uniform contingency tables derived from the online retail data of the previous section. We split the data over the countries and find contingency tables of "MonthName" vs "QuarterNameShort" over "Quantity". tbs = Association @ Map[ (xtab = CrossTabulate[#[[All, aColumnNamesM /@ {"MonthName", "QuarterNameShort", "Quantity"}]]]; #[[1, aColumnNamesM["Country"]]] -> xtab) &, GatherBy[ dataM, #[[aColumnNamesM[ "Country"]]] &]]; Magnify[#, 0.75] &@ Map[MatrixForm[#["XTABMatrix"], TableHeadings -> (# /@ {"RowNames", "ColumnNames"})] &, tbs]([[{1,12,-1}]]*) Using the object RSparseMatrix, see [4,5], we can impose row and column names on each table. First we convert the contingency tables into RSparseMatrix objects: tbs2 = Map[ ToRSparseMatrix[#["XTABMatrix"], "RowNames" -> #["RowNames"], "ColumnNames" -> #["ColumnNames"]] &, tbs]; And then we impose the desired row and column names: tbs2 = Map[ ImposeColumnNames[ ImposeRowNames[#, {"January", "December"}], {"Q1", "Q2", "Q3", "Q4"}] &, tbs2]; Magnify[#, 0.75] &@(MatrixForm /@ tbs2) References [1] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m. [2] Anton Antonov, Mosaic plot for data visualization implementation in Mathematica, (2014), MathematicaForPrediction at GitHub, package MosaicPlot.m. [3] Anton Antonov, "Mosaic plots for data visualization", (2014), MathematicaForPrediction at WordPress blog. URL: https://mathematicaforprediction.wordpress.com/2014/03/17/mosaic-plots-for-data-visualization/ . [4] Anton Antonov, RSparseMatrix Mathematica package, (2015) MathematicaForPrediction at GitHub. URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Misc/RSparseMatrix.m . [5] Anton Antonov, "RSparseMatrix for sparse matrices with named rows and columns", (2015), MathematicaForPrediction at WordPress blog. URL: https://mathematicaforprediction.wordpress.com/2015/10/08/rsparsematrix-for-sparse-matrices-with-named-rows-and-columns/ . [6] Daqing Chen, Online Retail Data Set, (2015), UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/datasets/Online+Retail .

POSTED BY: Anton Antonov

4 Replies

Sort By:

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 5 years ago

There is a function in Wolfram Function Repository for doing cross tabulation: `CrossTabulate`.

POSTED BY: Anton Antonov

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

Generalization : CrossTensorate A generalization of `CrossTabulate` is the function `CrossTensorate` implemented in [1] that takes a "formula" argument similar to R's `xtabs`. This finds number of people of different sub-groups of Titanic data: ctRes = CrossTensorate[Count == "passenger survival" + "passenger sex" + "passenger class", titanicData, aTitanicColumnNames]; MatrixForm[ctRes] We can verify the result using Count: Count[titanicData, {"1st", _, "female", "died"}] (* 5 ) Count[titanicData, {"2nd", _, "male", "survived"}] ( 23 *) To split the cross-tensor across its first variable we can use this command: sctRes = Association@ MapThread[Rule[#1, Join[<\|"XTABTensor" -> #2\|>, Rest@Rest@ctRes]] &, {ctRes[[2]], # & /@ ctRes["XTABTensor"]}]; MatrixForm /@ sctRes Or we can call the more general function `CrossTensorateSplit` implemented in [1]: Map[MatrixForm /@ CrossTensorateSplit[ctRes, #] &, Rest@Keys[ctRes]] `CrossTensorateSplit` can also be called with one argument that is a variable name.This will produce a splitting function. For example, the above command can be re-written as : Map[MatrixForm /@ CrossTensorateSplit[#] @ ctRes &, Rest@Keys[ctRes]]

POSTED BY: Anton Antonov

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

Straightforward calling of MatrixForm At this point we might want to be able to call MatrixForm more directly for the output of CrossTabulate and FromRXTabsForm. One way to do this is to define an up-value for Association . Unprotect[Association]; MatrixForm[x_Association /; (KeyExistsQ[x, "XTABMatrix"] \|\| KeyExistsQ[x, "XTABTensor"])] ^:= (MatrixForm[#1, TableHeadings -> Rest[{##}]] & @@ x); Protect[Association]; Now we can do this: MatrixForm @ CrossTabulate[titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]] Remark 1: Because of this up-value definition for Association with MatrixForm we have the associations returned by `CrossTabulate` and `FromRXTabsForm` to have the key "XTABMatrix" instead of "Matrix", the former assumed to be much more rarely to be used than the latter. Remark 2: The up-value definition above can be put in the `init.m` file of your system.

POSTED BY: Anton Antonov

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 9 years ago

- another post of yours has been selected for the Staff Picks group, congratulations ! We are happy to see you at the top of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: EDITORIAL BOARD

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

Contingency tables creation examples

Introduction

Code

Using Titanic data

Getting data

Data summary

Using CrossTabulate

Using R's xtabs (via RLink)

Relation to mosaic plots

Using larger data

Getting online retail invoices data

Data summary

Contingency tables

Country vs. StockCode

Country vs. Quarter

Uniform tables

References

Generalization : CrossTensorate

Straightforward calling of MatrixForm