# Contingency tables creation examples

Posted 5 years ago
8886 Views
|
4 Replies
|
12 Total Likes
|

## Introduction

In statistics contingency tables are matrices used to show the co-occurrence of variable values of multi-dimensional data. They are fundamental in many types of research. This document shows how to use several functions implemented in Mathematica for the construction of contingency tables.

### Code

In this document we are going to use the implementations in the package MathematicaForPredictionUtilities.m from MathematicaForPrediction at GitHub, [1].

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]


The implementation of CrossTabulate in [1] is based on Tally, GatherBy, and SparseArray. The implementation of xtabsViaRLink in [1] is based on R's function xtabs called via RLink.

Other package used in this document are [2] and [4] imported with these commands:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/RSparseMatrix.m"]


For a different approach to implementing cross-tabulation than those taken in [1] see the Stack Overflow answer http://stackoverflow.com/a/8101951 by Mr.Wizard. See my answer in the Community question "Two-way contingency table" using the data from that SO post.

## Using Titanic data

### Getting data

titanicData =
Flatten@*List @@@ ExampleData[{"MachineLearning", "Titanic"}, "Data"];
titanicData = DeleteCases[titanicData, {___, _Missing, ___}];

titanicColumnNames =
Flatten@*List @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
aTitanicColumnNames =


Note that we have removed the records with missing data (for simpler exposition).

### Data summary

Dimensions[titanicData]
(* {1046, 4} *)

RecordsSummary[titanicData, titanicColumnNames]


### Using CrossTabulate

Assume that we want to group the people according to their passenger class and survival and we want to find the average age for each group.

We can do that by

1. finding the counts contingency table $C$ for the variables "passenger class" and "passenger survival",

2. finding the age contingency table $A$ for the same variables, and

3. do the element-wise division $\frac{A}{C}$.

ctCounts =
CrossTabulate[
titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts


ctTotalAge =
CrossTabulate[
titanicData[[All,
aTitanicColumnNames /@ {"passenger class", "passenger survival",
"passenger age"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctTotalAge


MatrixForm[
ctTotalAge[[1]]/
Normal[ctCounts[[1]]],


(We have to make the sparse array ctCounts a regular array because otherwise we get warnings for division by zero because of the sparse array's default value.)

Let us repeat the steps above by dividing the passengers before-hand according to their sex.

Association@
Map[
(mCount =
CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]];
mAge = CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival", "passenger age"}]]];
#[[1,  aTitanicColumnNames["passenger sex"]]] ->
GatherBy[titanicData, #[[aTitanicColumnNames["passenger sex"]]] &]]


### Using R's xtabs (via RLink)

The alternative of CrossTabulate is xtabsViaRLink that is uses R's function xtabs via RLink.

Needs["RLink"]
InstallR[]

ctCounts =
FromRXTabsForm@
titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]],
{"passenger.sex", "passenger.survival"},
" ~ passenger.sex + passenger.survival"];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts


### Relation to mosaic plots

The graphical visualization of a dataset with mosaic plots, [2,3], is similar in spirit to contingency tables. Compare the following mosaic plot with the contingency table in the last section.

MosaicPlot[
titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]] ]


## Using larger data

Let us consider an example with larger data that has larger number of unique values in its columns.

### Getting online retail invoices data

The following dataset is taken from [6].

data = Import[ "/Volumes/WhiteSlimSeagate/Datasets/UCI Online Retail Data Set/Online Retail.csv"];
columnNames = First[data];
data = Rest[data];



### Data summary

We have $\approx 66000$ rows and $8$ columns:

Dimensions[data]
(* {65499, 8} *)


Here is a summary of the columns:

Magnify[#, 0.75] &@RecordsSummary[data, columnNames]


### Contingency tables

#### Country vs. StockCode

There is no one-to-one correspondence between the values of the column "Description" and the column "StockCode" which can be seen with this command:

MinMax@Map[Length@*Union,
GatherBy[data[[All, aColumnNames /@ {"Description", "StockCode"}]], First]]
(* {1,144} *)


The way in which the column "StockCode" was ingested made it have multiple types for its values:

Tally[NumberQ /@ data[[All, aColumnNames["StockCode"]]]]
(* {{False,9009},{True,56490}} *)


So let us convert it to all strings:

data[[All, aColumnNames["StockCode"]]] =
ToString /@ data[[All, aColumnNames["StockCode"]]];


Here we find the contingency table for "Country" and "StockCode" over "Quantity" using CrossTabulate:

AbsoluteTiming[
ctRes = CrossTabulate[
data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]]];
]
(* {0.256339,Null} *)


Here we find the contingency table for "Country" and "StockCode" over "Quantity" using xtabsViaRLink:

AbsoluteTiming[
data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]],
{"Country", "StockCode", "Quantity"},
"Quantity ~ Country + StockCode"];
ctRRes = FromRXTabsForm[rres];
]
(* {0.843621,Null} *)


Both functions produce the same result:

ctRRes["matrix"] == N@ctRRes[[1]]
(* True *)


Note that xtabsViaRLink is slower but still fairly quick.

Here we plot the contingency table using MatrixPlot :

MatrixPlot[ctRRes["matrix"], AspectRatio -> 1/1.5,
FrameTicks -> {{#, #} &@ Table[{i, ctRRes["rownames"][[i]]}, {i, Length[ctRRes["rownames"]]}],
{Automatic, Automatic}}, ImageSize -> 550]


#### Country vs. Quarter

Let us extend the data with columns that have months and quarters corresponding to the invoice dates.

The following commands computing date objects and extracting month and quarter values from them are too slow.

(*AbsoluteTiming[dobjs=DateObject[{#,{"Month","/","Day","/","Year"," \
","Hour",":","Minute"}}]&/@data[[All,aColumnNames["InvoiceDate"]]];
]*)
(* {30.2595,Null} *)

(*AbsoluteTiming[
dvals=DateValue[dobjs,{"MonthName","QuarterNameShort"}];
]*)
(* {91.1732,Null} *)


dvals = StringSplit[#, {"/", " ", ":"}] & /@
data[[All,
aColumnNames["InvoiceDate"]]];


This summary shows that the second value in the dates is day of month, and the first value is most likely month.

Magnify[#, 0.75] &@ RecordsSummary[dvals[[All, 1 ;; 3]], "MaxTallies" -> 16]


These commands extend the data and the corresponding column-name-to-index association.

ms = DateValue[Table[DateObject[{2016, i, 1}], {i, 12}], "MonthName"];
dvals = Map[{ms[[#]], "Q" <> ToString[Quotient[#, 4] + 1]} &, ToExpression @ dvals[[All, 1]]];
dataM = MapThread[Join[#1, #2] &, {data, dvals}];
aColumnNamesM =
Join[aColumnNames, <|"MonthName" -> (Length[aColumnNames] + 1), "QuarterNameShort" -> (Length[aColumnNames] + 2)|>];
(* {0.054877,Null} *)


Here is the contingency table for "Country" vs "QuarterNameShort" over "Quantity".

ctRes = CrossTabulate[ dataM[[All, aColumnNamesM /@ {"Country", "QuarterNameShort", "Quantity"}]]];
Magnify[#, 0.75] &@ MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctRes


## Uniform tables

Often when making contingency tables over subsets of the data we obtain contingency tables with different rows and columns. For various reasons (programming, esthetics, comprehension) it is better to have the tables with the same rows and columns.

Here is an example of non-uniform contingency tables derived from the online retail data of the previous section. We split the data over the countries and find contingency tables of "MonthName" vs "QuarterNameShort" over "Quantity".

tbs = Association @
Map[
(xtab = CrossTabulate[#[[All, aColumnNamesM /@ {"MonthName", "QuarterNameShort", "Quantity"}]]];
#[[1, aColumnNamesM["Country"]]] -> xtab) &,
GatherBy[ dataM, #[[aColumnNamesM[ "Country"]]] &]];

Magnify[#, 0.75] &@
Map[MatrixForm[#["XTABMatrix"], TableHeadings -> (# /@ {"RowNames", "ColumnNames"})] &, tbs](*[[{1,12,-1}]]*)


Using the object RSparseMatrix, see [4,5], we can impose row and column names on each table.

First we convert the contingency tables into RSparseMatrix objects:

tbs2 = Map[ ToRSparseMatrix[#["XTABMatrix"], "RowNames" -> #["RowNames"], "ColumnNames" -> #["ColumnNames"]] &, tbs];


And then we impose the desired row and column names:

tbs2 = Map[ ImposeColumnNames[ ImposeRowNames[#, {"January", "December"}], {"Q1", "Q2", "Q3", "Q4"}] &, tbs2];
Magnify[#, 0.75] &@(MatrixForm /@ tbs2)


## References

[1] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[2] Anton Antonov, Mosaic plot for data visualization implementation in Mathematica, (2014), MathematicaForPrediction at GitHub, package MosaicPlot.m.

[6] Daqing Chen, Online Retail Data Set, (2015), UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/datasets/Online+Retail .

4 Replies
Sort By:
Posted 5 years ago
 - another post of yours has been selected for the Staff Picks group, congratulations !We are happy to see you at the top of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!
Posted 5 years ago

## Straightforward calling of MatrixForm

At this point we might want to be able to call MatrixForm more directly for the output of CrossTabulate and FromRXTabsForm. One way to do this is to define an up-value for Association .

Unprotect[Association];
MatrixForm[x_Association /; (KeyExistsQ[x, "XTABMatrix"] || KeyExistsQ[x, "XTABTensor"])] ^:= (MatrixForm[#1, TableHeadings -> Rest[{##}]] & @@ x);
Protect[Association];


Now we can do this:

MatrixForm @
CrossTabulate[titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]


Remark 1: Because of this up-value definition for Association with MatrixForm we have the associations returned by CrossTabulate and FromRXTabsForm to have the key "XTABMatrix" instead of "Matrix", the former assumed to be much more rarely to be used than the latter.

Remark 2: The up-value definition above can be put in the init.m file of your system.

Posted 5 years ago

## Generalization : CrossTensorate

A generalization of CrossTabulate is the function CrossTensorate implemented in [1] that takes a "formula" argument similar to R's xtabs.

This finds number of people of different sub-groups of Titanic data:

ctRes = CrossTensorate[Count == "passenger survival" + "passenger sex" + "passenger class", titanicData, aTitanicColumnNames];
MatrixForm[ctRes]


We can verify the result using Count:

Count[titanicData, {"1st", _, "female", "died"}]
(* 5 *)

Count[titanicData, {"2nd", _, "male", "survived"}]
(* 23 *)


To split the cross-tensor across its first variable we can use this command:

sctRes = Association@
MapThread[Rule[#1, Join[<|"XTABTensor" -> #2|>, Rest@Rest@ctRes]] &, {ctRes[[2]], # & /@ ctRes["XTABTensor"]}];
MatrixForm /@ sctRes


Or we can call the more general function CrossTensorateSplit implemented in [1]:

Map[MatrixForm /@ CrossTensorateSplit[ctRes, #] &, Rest@Keys[ctRes]]


CrossTensorateSplit can also be called with one argument that is a variable name.This will produce a splitting function. For example, the above command can be re-written as :

Map[MatrixForm /@ CrossTensorateSplit[#] @ ctRes &, Rest@Keys[ctRes]]

 There is a function in Wolfram Function Repository for doing cross tabulation: CrossTabulate`.