Group Abstract Group Abstract

Message Boards Message Boards

2
|
3.2K Views
|
0 Replies
|
2 Total Likes
View groups...
Share
Share this post:
GROUPS:

An idiom for Dataset crosstabulation

At least in my experience, one often wants to take a Dataset and do some sort of crosstabulation. Here's an example drawn from real life. We have a Dataset (capital D) in which each row describes something (perhaps an insurance policy) such that one column contains a state of the United States in which the policy is sold and some other column contains another feature that has discrete values, perhaps whether the policy is sold to individuals, groups or both. And what we might well want to know is, for each state, how many policies are there that sell just to individuals, just to groups or to both. We want the output to be a Dataset that is an association of associations, in which second level association has precisely the same keys in the same order so that it displays nicely. Here's one way to do it. Assuredly there are other ways, perhaps some of them are better.

We start with a function that adds default values to an existing association and sorts the merged association in a consistent way.

AssociationPad[defaultAssociation_Association,]:=  (KeySort[Merge[{#, defaultAssociation}, First]] &)

We also need a function that takes a column of a dataset, finds all possible values, and creates an association mapping each of those possible choices to some default value.

 mergeRHS[ds_, column_, default_: 0] := 
  With[{choices = Normal[ds[Union, column]]}, 
   AssociationThread[choices, ConstantArray[default, Length[choices]]]]

We then use it as follows, where ds is some dataset that has "State" and "Market" as columns.

 ds[GroupBy[#State&],GroupBy[#Market&]/* AssociationPad[mergeRHS[ds, "Market"]],Length]

The result should generally be a new Dataset that is an association of associations in which the states are the keys of the outer associations and the possible values for Market are the keys of each of the nested associations. The values in the columns represent how many insurance policies there are for each possible market in each state.

Here's an example using the Titanic Dataset

 ds = ExampleData[{"Dataset", "Titanic"}]

Let's create a crosstabulation of class and gender

 ds[GroupBy["class"], 
  GroupBy[#sex &] /* AssociationPad[mergeRHS[ds, "sex"]], Length]

Here's one crosstabulating gender by survival that does not quite work (but I will show you how to fix it).

 ds[GroupBy["sex"], 
  GroupBy[#survived &] /* 
   AssociationPad[
    mergeRHS[ds, "survived"]] , Length]

The computation is done correctly but the display is not what one wants. This is, as I understand it, an issue with Mathematica, that can occur when all the keys of the inner association are not strings. Here's how you use KeyMap and ToString to fix it.

 ds[GroupBy["sex"], 
  GroupBy[#survived &] /* 
   AssociationPad[
    mergeRHS[ds, "survived"]] /* (KeyMap[ToString, #] &), Length]

Constructive criticism (or praise) welcome.

POSTED BY: Seth Chandler
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard