Message Boards Message Boards

47
|
69015 Views
|
27 Replies
|
63 Total Likes
View groups...
Share
Share this post:

A primer on Association and Dataset

NOTE: all Wolfram Language code and data are available in the attached notebook at the end of the post.


For my class this fall, I developed a little primer on Association and Dataset that I think might be useful for many people. So, I'm sharing the attached notebook. It's somewhat about the concepts embedded inside these features. It's intended for people at a beginner-intermediate level of Mathematica/Wolfram Language programming, but might be of value even to some more advanced users who have not poked about the Dataset functionality.

The sections of the notebook are:

  1. The world before Associations and Datasets
  2. Datasets without Associations
  3. Enter the Association
  4. Creating a Dataset from a List of Associations
  5. Nice queries with Dataset
  6. Query
  7. Some Recipes

The world before Associations and Datasets

Here' s an array of data. The data happens to represent the cabin class, age, gender, and survival of some of the passengers on the Titanic.

t = {{"1st", 29, "female", True}, {"1st", 30, "male", False}, {"1st", 
    58, "female", True}, {"1st", 52, "female", True}, {"1st", 21, 
    "female", True}, {"2nd", 54, "male", False}, {"2nd", 29, "female",
     False}, {"3rd", 42, "male", False}};

As it stands, our data is a List of Lists.

Head[t]

List

Head /@ t

{List, List, List, List, List, List, List, List}

Suppose I wanted to get the second and fifth rows of the data. This is how I could do it.

t[[{2, 5}]]

{{"1st", 30, "male", False}, {"1st", 21, "female", True}}

Suppose we want to group the passengers by gender and then compute the mean age. We could do this with the following pretty confusing code.

Use and enjoy. Constructive feedback appreciated.

grouped = GatherBy[t, #[[3]] &];
justTheAges = grouped[[All, All, 2]];
Mean /@ justTheAges

{189/5, 42}

Or I could write it as a one liner this way.

Map[Mean, GatherBy[t, #[[3]] &][[All, All, 2]]]

{189/5, 42}

But either way, realize that I have to remember that gender is the third column and that age is the second column. When there is a lot of data, this can get hard to remember.

Datasets without Associations

I could, if I wanted, convert this data into a Dataset. I do this below simply by wrapping Dataset about t. You see there is now some formatting about the data. But there are no column headers (because no one has told Dataset what to use). And there are no row headers, again because no one has told Dataset what to use.

t2 = Dataset[t]

enter image description here

The head of the expression has changed.

Head[t2]

Dataset

Now, I can now access the data in a different way.

Query[{2, 5}][t2]

enter image description here

Or, I can do this. Mathematica basically converts this expression into Query[{2,5}][t2]. The expression t2[{2,5}] is basically syntactic sugar.

t2[{2, 5}]

enter image description here

Digression : Using Query explicitly or using syntactic sugar

Why, by the way would anyone use the longer form if Mathematica does the work for you? Suppose you want to store a Dataset operation -- perhaps a complex series of Dataset operations -- but you want it to work not just on a particular Dataset but on any Dataset (that is compatible). Here's how you could do it.

q = Query[{2, 5}]

Query[{2, 5}]

q[t2]

enter image description here

Now, let' s create a permutation of the t2 Dataset so that the rows are scrambled up.

t2Scrambled = t2[{1, 4, 8, 3, 2, 7, 5}]

enter image description here

We can now run the q operation on t2Scrambled. Notice that the output has changed even though the query has stayed the same.

q[t2Scrambled]

enter image description here

We can also generate Query objects with functions. Here's a trivial example. There are very few languages of which I am aware that have the ability to generate queries by using a function. The one other example is Julia.

makeASimpleQuery[n_] := Query[n]
makeASimpleQuery[{3, 4, 7}][t2]

enter image description here

MapReduce operations on Dataset objects

Now, if I want to know the mean ages of the genders I can use this code. This kind of grouping of data and then performing some sort of aggregation operation on the groups is sometimes known as a MapReduce. (I'm not a fan of the name, but it is widely used). It's also sometimes known as a rollup or an aggregation.

Query[GroupBy[#[[3]] &], Mean, #[[2]] &][t2]

enter image description here

Or this shorthand form in which the Query is constructed.

t2g = t2[GroupBy[#[[3]] &], Mean, #[[2]] &]

enter image description here

I think this is a little cleaner. But we still have to remember the numbers of the columns, which can be challenging.

By the way, just to emphasize how we can make this all functional, here's a function that creates a query that can run any operation (not just computing the mean) on the Dataset grouped by gender and then working on age.

genderOp[f_] := Query[GroupBy[#[[3]] &], f, #[[2]] &]
genderOp[Max][t2]

enter image description here

To test your understanding, see if you can find the minimum age for each class of passenger on the Titanic in our Dataset t2.

Query[GroupBy[#[[1]] &], Min, #[[2]] &][t2]

enter image description here

Enter the Association

Review of Association

If you feel comfortable with Associations, you can skip this section; otherwise read it carefully. Basically the key to understanding most Dataset operations is understanding Associations.

Construction of Associations

Now let' s alter the data so that we don't have to remember those facts. To do this we will create an Association. Here's an example called assoc1. Notice that we do so by creating a sequence of rules and then wrapping it in an Association head. Notice that the standard output does not preserve the word "Association" as the head but, just as List is outputted as stuff inside curly braces, Association is outputted as stuff inside these funky "<|" and "|>" glyphs.

assoc1 = Association["class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True]

<|"class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True|>

I could equivalently have created a list of rules rather than a sequence. Mathematica would basically unwrap the List and create a sequence.

assoc1L = Association[{"class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True}]

<|"class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True|>

We can use AssociationThread to create Associations in a different way. The first argument is the list of things that go on the left hand side of the Rules -- the "keys" -- and the second argument is the list of things that go on the right hand side of the Rules -- the "values".

assoc1T = AssociationThread[{"class", "age", "gender", "survived"}, {"1st", 29, "female", True}]

<|"class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True|>

Now let's use AssociationThread function to create a list of Associations similar to our original data.

convertListToAssociation = 
 list \[Function] 
  AssociationThread[{"class", "age", "gender", "survived"}, list]

Function[list, AssociationThread[{"class", "age", "gender", "survived"}, list]]

I start with t and Map the convertListToAssociation function over the rows of the data. I end up with a list of Associations.

t3 = Map[convertListToAssociation, t]

enter image description here

Keys and Values

Associations have keys and values. These data structures are used in other computer languages but known by different names: Python and Julia call them dictionaries. Go and Scala call them maps. Perl and Ruby call them hashes. Java calls it a HashMap. And Javascript calls it an object. But they all work pretty similarly. Anyway, the keys of an Association are the things on the left hand side of the Rules.

Keys[assoc1]

{"class", "age", "gender", "survived"}

And the values of an Association are the things on the right hand side of the Rules.

Values[assoc1]

{"1st", 29, "female", True}

That' s about all there is too it. Except for one thing. Take a look at the input and output that follows.

assoc2 = Association["a" -> 3, "b" -> 4, "a" -> 5]

<|"a" -> 5, "b" -> 4|>

You can' t have duplicate keys in an Association. So, when Mathematica confronts duplicate keys, it uses the last key it saw. You might think this is a minor point, but it is actually very important in coding. We will see why soon.

Nested Associations

A funny thing happens if you nest an Association inside another Association.

Association[assoc1, assoc2]

<|"class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True, "a" -> 5, "b" -> 4|>

You end up with a single un - nested (flat) association. That's a little unusual for Mathematica, but we can exploit this flattening as a way of adding elements to an Association.

Association[Association["dances" -> False], assoc1]

<|"dances" -> False, "class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True|>

Or, here' s a function that exploits the flattening to add elements to an Association.

addstuff = Association[#, "dances" -> False, "sings" -> True] &

Association[#1, "dances" -> False, "sings" -> True] &

addstuff[assoc1]

<|"class" -> "1st", "age" -> 29, "gender" -> "female", "survived" -> True, "dances" -> False, "sings" -> True|>

Extracting Values from Associations

Just as the values contained in a List can be accessed by using the Part function, the values contained in an Association can likewise be accessed. Suppose, for example that I wanted to compute double the age of the person in assoc1.

It turns out there are a lot of ways of doing this. The first is to treat the Association as a list except that the indices, instead of being integers, are the "keys" that are on the left hand side of the rules.

2*Part[assoc1, "age"]

58

2*assoc1[["age"]]

58

A second way is to use Query. We can wrap the "key" in the head Key just to make sure Mathematica understands that the thing is a Key.

2*Query[Key["age"]][assoc1]

58

Usually we can omit the Key and everything works fine.

2*Query["age"][assoc1]

58

A third way is to write a function that has an association as its argument.

af = Function[Slot["age"]]

"#age &"

Now look what we can do.

2*Query[af][assoc1]

58

We can shorten this approach by using a simpler syntax for a function.

2*Query[#age &][assoc1]

58

Note, though that this still will not work. Basically, Mathematica is confused. It thinks the function itself is the key.

2*assoc1[af]

2 Missing["KeyAbsent", #age &]

But here' s a simple workaround. For very simple functions, I can just use the name of the key.

2*assoc1["age"]

58

A Note on Slot Arguments

And please pay attention to this : sometimes the Mathematica parser gets confused when it confronts a "slot argument" written as #something. If you see this happening, write it as Slot["something"].

Slot["iamaslot"] === #iamaslot

True

Here' s another problem. What if the key in the association has spaces or non-standard characters in it. Any of these, for example, are perfectly fine keys: the string "I have a lot of spaces in me", the string "Ihaveunderscores", the symbol True, the integer 43. But if we try to denote those keys by putting a hash in front of them, it will lead to confusion and problems.

problemAssociation = Association["I have a lot of spaces in me" -> 1, "I_have_underscores" -> 2, True -> 3, 43 -> 4]

<|"I have a lot of spaces in me" -> 1, "Ihaveunderscores" -> 2, True -> 3, 43 -> 4|>

{Query[#I have a lot of spaces in me &][problemAssociation], 
 Query[#I _have _underescores &][problemAssociation]}

enter image description here

Here' s a solution.

{Query[Slot["I have a lot of spaces in me"] &][problemAssociation], 
 Query[Slot["I_have_underscores"] &][problemAssociation]}

{1, 2}

Here' s how we solve the use of True and an integer as keys. We preface them with Key.

{Query[#True &][problemAssociation], Query[#43 &][problemAssociation]}

enter image description here

{Query[Key[True]][problemAssociation], 
 Query[Key[43]][problemAssociation]}

{3, 4}

Working with Associations and Lists of Associations

Here' s something we can do with the data in the form of an Association. I could ask for the gender of the person in the third row as follows. Notice I did not have to remember that "gender" was generally in the third position.

t3[[3]][["gender"]]

"female"

So, even if I scramble the rows, I can still use the same code.

t3Scrambled = Map[convertListToAssociation, t[[All, {4, 1, 3, 2}]]]

enter image description here

t3Scrambled[[3]][["gender"]]

female

I could also group the people according to their cabin class. Here I use Query on a list of Associations.

Query[GroupBy[#class &]][t3]

enter image description here

Again, the following code, which does not explicitly use Query, won' t work. Basically, nothing has told Mathematica to translate t3[stuff___] [RightArrow]Query[stuff][t3]. If t3 had a head of Dataset, Mathematica would know to make the translation.

t3[GroupBy[#class &]]

enter image description here

I can also get certain values for all the Associations in a list of Associations.

Query[All, #age &][t3]

{29, 30, 58, 52, 21, 54, 29, 42}

I can also map a function onto the result. I don't have to go outside the Query form to do so.

Query[f, #age &][t3]

f[{29, 30, 58, 52, 21, 54, 29, 42}]

Or, without exiting the Query form, I can map a function onto each element of the result.

Query[Map[f], #age &][t3]

{f[29], f[30], f[58], f[52], f[21], f[54], f[29], f[42]}

I could also do the same thing as follows.

Query[All, #age &, f][t3]

{f[29], f[30], f[58], f[52], f[21], f[54], f[29], f[42]}

Creating a Dataset from a List of Associations

To get full use out of Query and to permit syntactic shorthands, we need for Mathematica to understand that the list of Associations is in fact a Dataset. Here' s all it takes.

d3 = Dataset[t3]

enter image description here

We can recover our original list of associations by use of the Normal command.

t3 === Normal[d3]

True

With the data inside a Dataset object we now have pretty formatting. But we have more.

We can still do this. We get the same result but in a more attractive form.

d3g = Query[GroupBy[#class &]][d3]

enter image description here

But now this shorthand works too.

d3g = d3[GroupBy[#class &]]

enter image description here

And compare these two elements of code. When the data is in the form of a dataset, Mathematica understands that the stuff in the brackets is not intended as a key but rather is intended to be transformed into a Query.

{Query[#age &][t3[[1]]], d3[[1]][#age &]}

{29, 29}

A Dataset that is an Association of Associations

Let' s look under the hood of d3g.

d3gn = Normal[d3g]

enter image description here

Note : if you really want to look under the hood of a Dataset ask to see the Dataset in FullForm. You can also get more information by running the undocumented package Dataset`, but this is definitely NOT recommended for the non-advanced user.

What we see is an Association in which each of the values is itself a list of Associations.

We can map a function over d3gn.

Map[f, d3gn]

enter image description here

I can of course do the mapping within the Query construct.

Query[All, f][d3gn]

enter image description here

If I try synactic sugar, it doesn' t work because d3gn is not a Dataset.

d3gn[All, f]

Missing["KeyAbsent", All]

But, if I use the Dataset version, it does work. (The first line may be an ellipsis depending on your operating system and display, but if you look under the hood it looks just like the values for 2nd and 3rd. I have no idea why an ellipsis is being inserted.

d3g[All, f]

enter image description here

A Dataset that just has a single Association inside.

We can also have a Dataset that just has a single Association inside. Mathematica presents the information with the keys and values displayed vertically.

Dataset[d3[[1]]]

enter image description here

In theory, we could have a Dataset that just had a single number inside it.

Dataset[6]

enter image description here

Nice queries with Dataset

Now I can construct a query that takes a dataset and groups it by the gender column. It then takes each grouping and applies the Mean function to at least part of it. What part? The "age" column part. Notice that I no longer have to remember that gender is the third column and age is the second column.

qd = Query[GroupBy[#gender &], Mean, #age &]

Query[GroupBy[#gender &], Mean, #age &]

Now I can run this query on t3.

qd[d3]

enter image description here

We can now learn a lot about Query. So long as our data is in the form of a Dataset we can write the query as either a formal Query or use syntactic sugar.

Query

A major part of working with data is to understand Query. Let's start with a completely abstract Query, that we will call q1.

q1 = Query[f];

Now let' s run q1 on t3.

q1[t3]

enter image description here

We end up with a list of Associations that has f wrapped around it at the highest level. It's the same as if I wrote the following code.

f[t3] === q1[t3]

True

Now, let' s write a Query that applies the function g at the top level of the list of associations and the function f at the second level, i.e. to each of the rows. Why does it work at the second level? Because it's the second argument to Query.

q2 = Query[g, f];
q2[t3]

enter image description here

The result is the same as if I mapped f onto t3 at its first level and then wrapped g around it.

g[Map[f, t3, {1}]] === q2[d3]
Query[All, MapAt[StringTake[#, 1] &, #, {{"class"}, {"gender"}}] &][d3]

Here' s a function firstchar that takes the first character in a string. firstchar = StringTake[#, 1] &

StringTake[#1, 1] &

Now, let' s construct a query cg1 that applies firstchar to the class and gender keys in each row.

cg1 = Query[All, 
  a \[Function] MapAt[firstchar, a, {{"class"}, {"gender"}}]]

Query[All, Function[a, MapAt[firstchar, a, {{"class"}, {"gender"}}]]]

We apply cg1 to our little dataset d3.

cg1[d3]

enter image description here

What if we want to apply the same function to every element of the Dataset. We just apply it at the lowest level. Here's one way.

Query[Map[f, #, {-1}] &][d3]

enter image description here

We can also combine it with column wise and entirety wise operations. For reasons that are not clear, Mathematica can't understand this as a Dataset and returns the Normal form.

Query[(Map[f, #, {-1}] &) /* entiretywise, columnwise][d3]

enter image description here

Here' s how we could actually a multilevel Query.

Suppose we want to write a function that computes the fraction of the people in this little dataset that survived. The first step is simply going to be to extract the survival value and convert it to 1 if True and 0 otherwise. There's a built in function Boole that does this.

{Boole[True], Boole[False]}

{1, 0}

q3 = Query[something, 
  assoc \[Function] assoc["survived"] /. {True -> 1, _ -> 0}]

Query[something, Function[assoc, assoc["survived"] /. {True -> 1, _ -> 0}]]

q3[t3]

something[{1, 0, 1, 1, 1, 0, 0, 0}]

So, now we have something wrapping a list of 1 s and 0 s. By making something the Mean function, we can achieve our result.

q4 = Query[Mean, Boole[#survived] &]

Query[Mean, Boole[#survived] &]

q4[d3]

1/2

We can also examine survival by gender. Notice that Query is a little like Association: it gets automatically flattened.

Query[GroupBy[#gender &], q4][t3]

<|"female" -> 4/5, "male" -> 0|>

If the data is held in a Dataset, we can also write the final step as follows.

d3[GroupBy[#gender &], q4]

enter image description here

Notice that even if we omit the "Query", this code works. Mathematica just figures out that you meant Query.

The code immediately above is in the form we typically see and often use.

Some Recipes

titanic = ExampleData[{"Dataset", "Titanic"}]

enter image description here

How to add a value to the Dataset based on values external to the existing columns.

Here' s some additional data. Notice that the data is the same length as the titanic dataset.

stuffToBeAdded = 
 Table[Association["id" -> i, 
   "weight" -> RandomInteger[{80, 200}]], {i, Length[titanic]}]

enter image description here

We use Join at level 2.

augmentedTitanic = Join[titanic, stuffToBeAdded, 2]

enter image description here

How to add a column to a Dataset based on values in the existing columns and to do so row-wise

Notice that the query below does NOT change the value of the titanic dataset. To change the value of the titanic dataset, one would need to set titanic to the result of the computation. Remember, Mathematica generally does not have side effects or do modifications in place.

Query[All, Association[#, "classsex" -> {#class, #sex}] &][titanic]

enter image description here

We can add multiple columns this way.

Query[All, 
  Association[#, "classsex" -> {#class, #sex}, 
    "agesqrt" -> Sqrt[#age]] &][titanic]

enter image description here

How to change the value of an existing column : row - wise

Age everyone one year.

Query[All, Association[#, "age" -> #age + 1] &][titanic]  

enter image description here

How to change the value of columns selectively.

Query[All, 
  Association[#, 
    "age" -> If[#sex === "male", #age + 1, #age]] &][titanic]

enter image description here

How to create a new column based on some aggregate operator applied to another column.

With[{meanAge = Query[Mean, #age &][titanic]}, 
  Query[All, 
   Association[#, "ageDeviation" -> #age - meanAge] &]][titanic]

enter image description here

Can you develop your own recipes?

Attachments:
POSTED BY: Seth Chandler
27 Replies
Posted 3 years ago

Finally, thank you.

Something to consider, having a button to go to top of page and at end of the OP instead of scrolling a lot or also putting the attachment link also at end of the last post. When one is visually impaired navigation sometimes can be a problem.

POSTED BY: Andrew Meit
Posted 3 years ago

So this post gets bumped yet again; but no notebook yet. Seth, please restore your notebook. Thank you. And yet also no book from you. Or is this post what is in the notebook and so no need for the notebook? Frustrated and confused.

POSTED BY: Andrew Meit

Andrew, you can find the notebook currently attached to the main post.

POSTED BY: Ahmed Elbanna
Posted 3 years ago

Steven, thanks.

If there are indeed documenters out there reading this, here's another one. Dataset has a kind of cool behavior showing the item path just below the display, and it really needs to be paired with an option for PathDisplayFunction or equivalent.

Allan

POSTED BY: A Cooper
Posted 3 years ago

Seth, thank you for this wonderful primer.

As you point out,

We can also have a Dataset that just has a single Association inside. Mathematica presents the information with the keys and values displayed vertically.

Is this considered a feature or a bug? Super annoying to have computers doing random unexpected stuff. No option or setting to control this behavior. Or am I missing something?

Thanks!

Allan

POSTED BY: A Cooper

Dataset seems to be evolving rapidly. In V 12.0, it had no Options, in V 12.3 it has over a dozen. I suspect that would-be documenters are having trouble keeping up.

POSTED BY: Stephen Wandzura
Posted 3 years ago

Why is the notebook missing. Are you allowed to repost the notebook? Noticed no book forthcoming yet from him. Any other related primer?

POSTED BY: Andrew Meit
Posted 3 years ago

The notebook needs to be restored; please. Why is this taking so long to get restored??

POSTED BY: Andrew Meit
Posted 3 years ago

I agree! Where is the notebook?

POSTED BY: Douglas Kubler

Thanks to @Seth Chandler, author of this post, the notebook is restored again. You can find it attached to the main post and to this message too.

Attachments:
POSTED BY: Ahmed Elbanna
Posted 3 years ago

In most cases, queries with keys using (key) names or slots will give the identical results.

However, in some cases the way Mathematica handles names or slots can lead to different results.

I ran into this example:

Query[All, All, Delete@"class"]@
 GroupBy[#class &]@ExampleData[{"Dataset", "Titanic"}]

Which drops/deletes the class column as so:

enter image description here

If we try this using the Slot notation we get a different result:

Query[All, All, Delete@#class &]@
 GroupBy[#class &]@ExampleData[{"Dataset", "Titanic"}]

enter image description here

I think there may be a sematic difference between the two notations. I suspect the name notation refers to the whole column (position), whereas the slot notation refers to the items in the Dataset under that name. In most cases, it will lead to equivalent result.

POSTED BY: Dave Middleton
Posted 3 years ago

Thank you for sharing your Dataset Primer.

Initially, I used Datasets by trial and error. The Mathematica Reference Documentation is a great resource, but this post shows again that we may need a more extensive, hands-on tutorial.

Your Primer and numerous resources on StackExchange or some books helped me on my way with Datasets.

Cheers,

Dave

POSTED BY: Updating Name

This is a fantastic resource, many thanks.

I have a question concerning Datasets. When comparing associations to lists, I have found that the efficiency gain of using associations instead of lists can be very, very substantial.

Is there an analogous strong incentive for using Datasets instead of, for example, lists of lists or associations of associations?

Thanks,

Francisco

Is the notebook file attached? I didn't see it?

POSTED BY: Stephen Wandzura

Great resource, thanks!

Hi Alan, thanks for the offer. I'd love to see those sample chapter preprints. My email address is ruben dot garcia at jic dot ac dot id

Often, it's not necessary to use Slot for positional dereference, eg Query[2,f,3] evaluates the same as Query[#[[2]]&,f,#[[3]]&]. Similarly, Span works as well.

Ps, for those interested, I'm close to finishing my book Functional Data Workflow which is based on real-world methods and data collected as part of large time-motion/UX/EHR studies at two large healthcare organizations. Email if you'd like to see sample chapter preprints.

POSTED BY: Alan Calvitti
Posted 7 years ago

Excellent! thanks for sharing it!

POSTED BY: Andres Aldana

It would be great if this -- or something with similar depth -- made it into the official Wolfram Mathematica documentation.

Definitely. That would be really welcome.

POSTED BY: Arno Bosse

Be patient! A long book on the topic is coming. Before end of 2019.

POSTED BY: Seth Chandler

Hi Seth, Any updates on the book? I know such things take longer than expected, but it will be very useful. WCC

POSTED BY: W. Craig Carter
Posted 3 years ago

Hi ,Seth,

Is your book published?

POSTED BY: Pred Liu

Hi. Did you get your book published?

It was announced here in this community; the link to the book is: https://www.wolfram.com/language/query-getting-information-from-data-with-the-wolfram-language/

POSTED BY: Dave Middleton

Seth is presenting a Wolfram-U webinar on the book.

POSTED BY: Rohit Namjoshi

enter image description here - Congratulations! This post is now a Staff Pick as distinguished on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract