Message Boards Message Boards

9
|
14256 Views
|
1 Reply
|
9 Total Likes
View groups...
Share
Share this post:

How to make use of Associations

NOTE: Please see the original version of this post HERE where original questions that motivated this post were formulated. Cross-posted here per suggestion of Vitaliy Kaurov. Also note a related discussion on the interplay between associations and mutability, which, in my view, nicely complements this one.


I. General

I will first try to briefly answer the questions, and then illustrate this with a small but practical application.

1.Speed of insertion / deletion

Associations are based on so called Hash Array Mapped Trie persistent data structure. One can think of this as a nested hash table, but it is more than that, because it has the following properties:

  • Immutable
  • Insertion and deletion are O(log N)

One may ask how does immutability even allow for such performance characteristics, and this is where it becomes non-trivial. Basically, it is this combination which is very important for our case, since Mathematica favors immutable code.

Runtime efficiency

In practical terms, insertion and deletion are pretty fast, and you can also insert and delete key-value pairs in bulks. This latter capability is important speed booster. Here is a short example:

store = Association[{}];
Do[AppendTo[store, i -> i + 1], {i, 100000}]; // AbsoluteTiming
Length[store]

(* {0.497726, Null} *)

(* 100000 *)

Now constructing in a bulk:

store = 
   Association[
      Thread[Range[100000] -> Range[100000] + 1]
   ]; // AbsoluteTiming
Length[store]

(* {0.130438, Null} *)

(* 100000 *)

This small benchmark is important, since it shows that on one hand, insertion and deletion are fast enough that the runtime is dominated by the top-level evaluator (in the case of per-element insertion), and on the other hand, the difference is still not the order of magnitude or more - so insertion / deletion time is comparable with the top-level iteration overhead (although half - order of magnitude smaller).

Memory efficiency

One thing to keep in mind is that Associations are rather memory-hungry. This is to be expected from the structures which can't take advantage of compact memory layout of e.g. arrays. Here is the benchmark for our example:

ByteCount[store]

(* 11167568 *)

ByteCount[Range[100000]]*2

(* 1600288 *)

The difference can be larger for more complex data structures, but it also can be smaller if many keys point to same expressions, since ByteCount does not account for memory-sharing.

2. Advantages over DownValues

The main advantages are:

  • Immutability
  • Possibility to access deeper levels inside Associations transparently by structural commands such as Map.
  • Deep integration into the language - many old commands work on Associations, and there are many new functions (GroupBy, JoinAcross, etc), which make it easy to accomplish complex tasks.

Immutability

Of these, I value immutability the most. What this means is that you can safely copy an Association, or pass it anywhere, and it becomes completely decoupled from the original one. It has no state. So, you don't have to worry about deep-copying vs shallow-copying, and all that.

This is very important since most other Mathematica's data structures are immutable (notably Lists, but also others such as general expressions, sparse arrays, etc). It takes a single mutable structure to break immutability, if it is present as a part in a larger structure. This is a really big deal, because it allows one in many cases to solve problems which otherwise would require manual resource management.

In particular, should you use symbols / their DownValues, you'd have to generate new symbols and then manage them - monitor when they are no longer needed and release them. This is a pain in the neck. With Associations, you don't have to do this - they are automatically garbage-collected by Mathematica, once no longer referenced. What's more, also other expressions inside those Associations are then garbage-collected, if those are also no longer referenced.

Accessing deeper levels in expressions

This is very valuable, since it allows one to save a lot of coding effort and keep thinking on a higher level of abstraction. Here is an example:

stocks = <|1 -> <|"AAPL" -> {<|"company" -> "AAPL", 
   "date" -> {2014, 1, 2}, 
   "open" -> 78.47009708386817`|>, <|"company" -> "AAPL", 
   "date" -> {2014, 1, 3}, "open" -> 78.07775518503458`|>}, 
"GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2}, 
   "open" -> 27.393978181818177`|>, <|"company" -> "GE", 
   "date" -> {2014, 1, 3}, "open" -> 27.05933042212518`|>}|>|>

The first key here is the month, and is the only one since all prices happen to be for the January. Now, we can, for example, Map some function on various levels:

Map[f, stocks]

(*
  <|1 -> 
  f[<|"AAPL" -> {<|"company" -> "AAPL", "date" -> {2014, 1, 2}, 
   "open" -> 78.4701|>, <|"company" -> "AAPL", 
   "date" -> {2014, 1, 3}, "open" -> 78.0778|>}, 
"GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2}, 
   "open" -> 27.394|>, <|"company" -> "GE", 
   "date" -> {2014, 1, 3}, "open" -> 27.0593|>}|>]|>
*)

Map[f, stocks, {2}]

(*
  <|1 -> <|"AAPL" -> 
f[{<|"company" -> "AAPL", "date" -> {2014, 1, 2}, 
   "open" -> 78.4701|>, <|"company" -> "AAPL", 
   "date" -> {2014, 1, 3}, "open" -> 78.0778|>}], 
"GE" -> f[{<|"company" -> "GE", "date" -> {2014, 1, 2}, 
   "open" -> 27.394|>, <|"company" -> "GE", 
   "date" -> {2014, 1, 3}, "open" -> 27.0593|>}]|>|>
*)

Map[f, stocks, {3}]

(*
  <|1 -> <|"AAPL" -> {f[<|"company" -> "AAPL", 
   "date" -> {2014, 1, 2}, "open" -> 78.4701|>], 
 f[<|"company" -> "AAPL", "date" -> {2014, 1, 3}, 
   "open" -> 78.0778|>]}, 
"GE" -> {f[<|"company" -> "GE", "date" -> {2014, 1, 2}, 
   "open" -> 27.394|>], 
 f[<|"company" -> "GE", "date" -> {2014, 1, 3}, 
   "open" -> 27.0593|>]}|>|>
*)

This last example makes it easy to see how would we, for example, round all prices to integers:

Map[MapAt[Round, #, {Key["open"]}] &, #, {3}] & @ stocks

(*
 <|1 -> <|"AAPL" -> {<|"company" -> "AAPL", 
  "date" -> {2014, 1, 2}, "open" -> 78|>, <|"company" -> "AAPL", 
  "date" -> {2014, 1, 3}, "open" -> 78|>}, 
"GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2}, 
  "open" -> 27|>, <|"company" -> "GE", "date" -> {2014, 1, 3}, 
  "open" -> 27|>}|>|>
*)

As noted by Taliesin Beynon in comments, there are more elegant ways to do this, using the new operator forms for Map and MapAt:

Map[MapAt[Round, "open"], #, {3}] & @ stocks

and

MapAt[Round, {All, All, All, "open"}] @ stocks

which illustrate my point of transparent access to deeper layers even more.

So, what does this buy us? A lot. We do use here immutability heavily, because it is only due to immutability that functions such as Map can operate on Associations efficiently, producing new ones, completely decoupled from the old ones. In fact, as long as manipulations are structural ones on "higher levels", this is very efficient, because the actual expressions (leaves at the bottom) might be untouched.

But there is more here. With just one command, we can transparently inject stuff on any level in an Association. This is very powerful capability. Just think of what would be involved in doing so with the traditional hash tables (nested DownValues, for instance). You will have to generate several symbols (often many of them), then manually traverse the nested structure (non-transparent, much more code, and slower), and also do manual resource management for these symbols.

Integration with the language

I may expand later on this, but many examples have been given in other answers. Basically, lots of functions (Part, Map, MapAt, MapIndexed, Delete, etc) do work on Associations, including nested ones. Besides, you can use multi-arg Part on nested associations having inside them other expressions (Lists,Associations, etc). In addition to this, a host of new functions have been introduced to work specifically on Associations, making it easy to do complex data transformations (GroupBy, Merge, JoinAcross, Catenate, KeyMap, etc). The language support for Associations is approaching that for List-s. Together, they make the core of the data-processing primitives.

So, the addition of Associations to the language made it strictly more powerful, primarily because of two things: level of language integration and immutability.

3. Dataset and large data.

Right now, Dataset is not suitable for working with really large data (the one that does not fit into memory). Since Associations are rather memory-hungry, it puts additional constraints on the sizes of data sets amenable to the current version of Dataset.

However, work is underway to address this problem. Currently, the best way to view Dataset is IMO as a query language specification with an in-memory implementation. In the future, it can also have other / different implementations / backends.

Also, in practice, a lot of interesting data sets are still small enough that can be effectively worked with using Dataset. This is particularly true for various "business-type" data, which tend to often not be very huge. The huge data often involves large numerical data sets, and I am sure this case will be addressed by the Dataset framework in the near future.

4. Examples of improvements

See section III.

II. Associations as objects / structs

Associations can be used as structs. To illustrate some of the possibilities, I will use a simple object which has to store person's first and last name, and have get and set methods for them, and also have an additional method to return full name. I will consider three different ways to implement this, two of which will use Associations

1. Mutable struct implementation (one of the possibilities)

Here is the code:

ClearAll[makePersonInfoMutable];
makePersonInfoMutable[fname_, lname_] :=
  Module[{fn = fname, ln = lname, instance},
    SetAttributes[instance, HoldAll];
    instance @ getFirstName[] := fn;
    instance @ setFirstName[name_] := fn = name;
    instance @ getLastName[] := ln;
    instance @ setLastName[name_] := ln = name;
    instance @ getFullName[] := fn <> " " <> ln;
    instance];

Here is how one can use this:

pinfo = makePersonInfoMutable["Leonid", "Shifrin"]

(* instance$552472 *)

pinfo @ getFirstName[]

(* "Leonid" *)

pinfo @ setLastName["Brezhnev"]
pinfo @ getFullName[]

(*  "Leonid Brezhnev" *)

This method is Ok, but it has some short-comings: one needs to introduce several internal mutable variables, which must be manually managed. Also, the instance variable itself must be managed.

2. Using Associations - the immutable way

One can instead use Associations very simply, as follows:

pinfoIm = <|"firstName" -> "Leonid", "lastName" -> "Shifrin"|>

(* <|"firstName" -> "Leonid", "lastName" -> "Shifrin"|> *)

pinfoIm["firstName"]

(* "Leonid" *)

AppendTo[pinfoIm, "lastName" -> "Brezhnev"]

(* <|"firstName" -> "Leonid", "lastName" -> "Brezhnev"|> *)

This is fine and efficient, and no additional symbol /state management is needed here. However, this method also has its short-comings:

  • No natural way to define methods on such objects (have to be just functions, but then they will produce new objects)

  • What if I do want the changes made to the object to be reflected in other places where the object is used. In other words, what if I don't want to create an immutable copy, but want instead to share some state?

So, this method is fine as long as the problem can be completely addressed by immutable objects (no state).

3. Combining Associations and mutability

One can do this using the following method (of my own invention, so can't guarantee it will always work):

pinfoSM = 
  Module[{self},
    self = 
     <|
       "firstName" -> "Leonid",
       "lastName" -> "Shifrin",
       "setField" -> 
           Function[{field, value}, self[field] = value; self = self],
       "fullName" ->  
           Function[self@"firstName" <> " " <> self@"lastName"],
       "delete" -> Function[Remove[self]]
     |>
  ];

what happens here is that we capture the Module variable, and use it inside the Association. In this way, we inject some mutability into otherwise immutable structure. You can see that now, we can define "methods" - functions which work on this particular instance, and possibly mutate its state.

Here is an example of use:

pinfoSM["firstName"] 

(* "Leonid" *)

pinfoSM["setField"]["lastName", "Brezhnev"];
pinfoSM["fullName"][]

(* "Leonid Brezhnev" *)

Note that here we used an extra pair of brackets to perform the function call. If you don't like this syntax, you can instead of the line

"fullName" ->  Function[self@"firstName" <> " " <> self@"lastName"]

use

"fullName" :>  self@"firstName" <> " " <> self@"lastName"

and then call just pinfoSM["fullName"] (this is possible because Associations respect RuleDelayed for the key-value pairs, and don't then evaluate the r.h.s. (value) until it is extracted). In this way, the fields can be made to behave similar to Python's properties.

EDIT

As noted by saturasl in comments, the above version exhibits erroneous behavior when changed properties are accessed directly. In the last example, for instance, after the change we still get

pinfoSM["lastName"]

(* "Shifrin" *)

The reason is that while self has changed, the pinfoSM still stores the same field values for lastName and firstName.

One possible solution here is in the spirit of Python's properties: hide the actual fields, and introduce the accessors with the names which we previously used for the fields themselves:

pinfoSM = 
  Module[{self}, 
    self =
      <|
         "_firstName" -> "Leonid", 
         "_lastName" -> "Shifrin", 
         "setField" -> 
            Function[{field, value}, 
               self["_" <> field] = value; 
               self = self], 
         "fullName" -> 
            Function[self@"firstName" <> " " <> self@"lastName"], 
         "delete" -> Function[Remove[self]],
         "firstName" :> self@"_firstName",
         "lastName"  :> self@"_lastName"
      |>
  ];

Now the previous code will all work, and we also have after change:

pinfoSM["lastName"]

(* "Brezhnev" *)

As it should be. It is understood that the fields "_firstName" and "_lastName" are private and should not be accessed directly, but rather via the "accessor" fields "firstName" and "lastName". This provides a level of indirection needed to account for the changes in self correctly.

END EDIT

So, this version is stateful. Still, depending on the problem, it may have advantages. One is for cases where you want all instances of the object to update if you make a change in one (in other words, you don't want an independent immutable copy). Another is that the "methods" here work specifically on a given instance. You do need to manage these objects (destroy them once they are no longer referenced), but here you only have one symbol which is stafeful. I find this construct to be a nice combination of mutable and immutable state.

III. Example: a toy hierarchical database

Here, I will illustrate the utility of both Associations and a new operator form of functional programming in Mathematica, by constructing a toy hierarchical database of stock data.

Sample data

We start with the data:

data = 
 Composition[
   Map[Association],
   Flatten[#, 1] &,
   Map[
    Function[
      company,
      Composition[
        Map[
          Composition[
            Prepend["company" -> company],
            MapThread[Rule, {{"date", "open"}, #}] &
          ]
        ],
        If[MatchQ[#, _Missing], {}, #] &,
        FinancialData[#, "Open", {{2013, 12, 25}, {2014, 1, 05}}] &
      ] @ company
    ]]]@{"GOOG", "AAPL", "MSFT", "GE"}

Here is the result:

(*
{<|"company" -> "AAPL", "date" -> {2013, 12, 26}, "open" -> 80.2231|>,    
 <|"company" -> "AAPL", "date" -> {2013, 12, 27}, "open" -> 79.6268|>, 
 <|"company" -> "AAPL", "date" -> {2013, 12, 30}, "open" -> 78.7252|>, 
 <|"company" -> "AAPL", "date" -> {2013, 12, 31}, "open" -> 78.2626|>, 
 <|"company" -> "AAPL", "date" -> {2014, 1, 2}, "open" -> 78.4701|>,
 <|"company" -> "AAPL", "date" -> {2014, 1, 3}, "open" -> 78.0778|>, 
 <|"company" -> "MSFT", "date" -> {2013, 12, 26}, "open" -> 36.6635|>,
 <|"company" -> "MSFT", "date" -> {2013, 12, 27}, "open" -> 37.0358|>, 
 <|"company" -> "MSFT", "date" -> {2013, 12, 30}, "open" -> 36.681|>, 
 <|"company" -> "MSFT", "date" -> {2013, 12, 31}, "open" -> 36.8601|>, 
 <|"company" -> "MSFT", "date" -> {2014, 1, 2}, "open" -> 36.8173|>, 
 <|"company" -> "MSFT", "date" -> {2014, 1, 3},  "open" -> 36.6658|>, 
 <|"company" -> "GE", "date" -> {2013, 12, 26}, "open" -> 27.2125|>, 
 <|"company" -> "GE", "date" -> {2013, 12, 27}, "open" -> 27.3698|>, 
 <|"company" -> "GE", "date" -> {2013, 12, 30}, "open" -> 27.3708|>, 
 <|"company" -> "GE", "date" -> {2013, 12, 31}, "open" -> 27.4322|>, 
 <|"company" -> "GE", "date" -> {2014, 1, 2}, "open" -> 27.394|>, 
 <|"company" -> "GE", "date" -> {2014, 1, 3}, "open" -> 27.0593|>
}
*)

Note that the code to construct this result heavily uses the operator forms for various functions (here Map and Prepend), and also Composition is frequently used. This has many advantages, including clarity and maintainability (but there are others too).

Generating the transform to nested data store

The following functions will generate a transform, that would transform the above data into a nested data store, built out of Lists and Associations

ClearAll[keyWrap];
keyWrap[key_Integer] := key;
keyWrap[key_] := Key[key];

ClearAll[pushUp];
(* Operator form *)
pushUp[key_String] := pushUp[{key}];
pushUp[{keyPath__}] :=
  With[{keys = Sequence @@ Map[keyWrap, {keyPath}]},
    GroupBy[Part[#, keys] &]
  ];
(* Actual form *)
pushUp[assocs : {__Association}, keys__] :=
  pushUp[keys][assocs];

(* Constructs a transform to construct nested dataset from flat table *)
ClearAll[pushUpNested];
pushUpNested[{}] := Identity;
pushUpNested[specs : {_List ..}] :=
  Composition[
    Map[pushUpNested[Rest[specs]]],
    pushUp@First[specs]
  ];

The pushUp function is basically GroupBy, wrapped in a different syntax (which makes it easier to specify multi-part paths). I have simplified from the one I have used for my purposes - the original version was also deleting the key on which we group, from the grouped associations.

In our case, we need to supply the specification to get the nested data set. Here is an example, where we group by the year first, then by the month, and then by the company name:

transform = pushUpNested[{{"date", 1}, {"date", 2}, {"company"}}]

(* 
  Map[Map[Map[Identity]@*GroupBy[#1[[Sequence[Key["company"]]]] &]]@*
   GroupBy[#1[[Sequence[Key["date"], 2]]] &]]@*
     GroupBy[#1[[Sequence[Key["date"], 1]]] &]
*)

Note that this operator approach has a number of advantages. It is declarative, and at the end we generate a complex transformation function which can be analyzed and argued about.

Now, here is how it can be used:

 nested = transform @ data

(*
 <|2013 -> <|12 -> <|"AAPL" -> {<|"company" -> "AAPL", 
    "date" -> {2013, 12, 26}, 
    "open" -> 80.2231|>, <|"company" -> "AAPL", 
    "date" -> {2013, 12, 27}, 
    "open" -> 79.6268|>, <|"company" -> "AAPL", 
    "date" -> {2013, 12, 30}, 
    "open" -> 78.7252|>, <|"company" -> "AAPL", 
    "date" -> {2013, 12, 31}, "open" -> 78.2626|>}, 
 "MSFT" -> {<|"company" -> "MSFT", "date" -> {2013, 12, 26}, 
    "open" -> 36.6635|>, <|"company" -> "MSFT", 
    "date" -> {2013, 12, 27}, 
    "open" -> 37.0358|>, <|"company" -> "MSFT", 
    "date" -> {2013, 12, 30}, 
    "open" -> 36.681|>, <|"company" -> "MSFT", 
    "date" -> {2013, 12, 31}, "open" -> 36.8601|>}, 
 "GE" -> {<|"company" -> "GE", "date" -> {2013, 12, 26}, 
    "open" -> 27.2125|>, <|"company" -> "GE", 
    "date" -> {2013, 12, 27}, 
    "open" -> 27.3698|>, <|"company" -> "GE", 
    "date" -> {2013, 12, 30}, 
    "open" -> 27.3708|>, <|"company" -> "GE", 
    "date" -> {2013, 12, 31}, "open" -> 27.4322|>}|>|>, 
 2014 -> <|1 -> <|"AAPL" -> {<|"company" -> "AAPL", 
    "date" -> {2014, 1, 2}, 
    "open" -> 78.4701|>, <|"company" -> "AAPL", 
    "date" -> {2014, 1, 3}, "open" -> 78.0778|>}, 
 "MSFT" -> {<|"company" -> "MSFT", "date" -> {2014, 1, 2}, 
    "open" -> 36.8173|>, <|"company" -> "MSFT", 
    "date" -> {2014, 1, 3}, "open" -> 36.6658|>}, 
 "GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2}, 
    "open" -> 27.394|>, <|"company" -> "GE", 
    "date" -> {2014, 1, 3}, "open" -> 27.0593|>}|>|>|>
*)

You can see the immediate advantage of this - it is very easy to construct any other nested structure we want, with different grouping at different levels.

Querying the nested structure

Along the same lines, here is how we can construct queries to run against this structure. For simplicity, I will only consider queries which specify explicitly the keys we want to keep at each level, as a list, or All if we want to keep all entries at that level.

Here is the query generator:

(* Modified Part, to stop at missing elements *)
ClearAll[part];
part[m_Missing, spec__] := m;
part[expr_, spec__] := Part[expr, spec];


(* Builds a query to run on nested dataset *)
ClearAll[query];
query[{}] := Identity;
query[spec : {(_List | All) ..}] :=
  Composition[
    Map[query[Rest[spec]]],
    With[{curr = First@spec},
      If[curr === All,
        # &,
        part[#, Key /@ curr] &
      ]
    ]
  ];

It also heavily uses the operator form, constructing a rather complex function to query the nested data set, from a simple spec.

Let us now try constructing some queries:

q = query[{{2013}, All, {"AAPL", "MSFT"}}]

(*
  Map[Map[Map[
    Identity]@*(part[#1, 
    Key /@ {"AAPL", "MSFT"}] &)]@*(#1 &)]@*(part[#1, Key /@ {2013}] &)
*)

Now, we can run it:

q @ nested

(*

 <|2013 -> <|12 -> <|"AAPL" -> {<|"company" -> "AAPL", 
    "date" -> {2013, 12, 26}, 
    "open" -> 80.2231|>, <|"company" -> "AAPL", 
    "date" -> {2013, 12, 27}, 
    "open" -> 79.6268|>, <|"company" -> "AAPL", 
    "date" -> {2013, 12, 30}, 
    "open" -> 78.7252|>, <|"company" -> "AAPL", 
    "date" -> {2013, 12, 31}, "open" -> 78.2626|>}, 
 "MSFT" -> {<|"company" -> "MSFT", "date" -> {2013, 12, 26}, 
    "open" -> 36.6635|>, <|"company" -> "MSFT", 
    "date" -> {2013, 12, 27}, 
    "open" -> 37.0358|>, <|"company" -> "MSFT", 
    "date" -> {2013, 12, 30}, 
    "open" -> 36.681|>, <|"company" -> "MSFT", 
    "date" -> {2013, 12, 31}, "open" -> 36.8601|>}|>|>|>
*)

Now, let's look back and see what we've done: in just a few lines of code, we have constructed a fully functional small hierarchical database (actually, a generator of such databases), based on nested Association-s, and then a query generator which allows one to construct and run simple queries against that database.

Now, this has been a toy dataset. Is this construction practical for larger set of data (like, tens of thousands of records and more)? Yes! I have originally written this type of code for a problem involving data sets with hundreds of thousands of records, and the queries run extremely fast, as long as most of the data is categorical in nature (can be reduced to a finite small set of distinct keys on each level).

Now, think about what would be involved in implementation of this type of thing without Associations. My bet is that this wouldn't even be possible, or at the very least would've been much more work. And because usual hash tables are not immutable, the whole elegant operator approach of construction of queries / transforms as function composition in a declarative way wouldn't even come to mind (at least as far as I am concerned).

A note on Dataset

Incidentally, Dataset uses a more complex version of the same set of ideas. I can now also partly answer the questions many people asked about what does the Dataset bring which isn't readily available without it. The answer is that, by generating queries in a way conceptually similar to the above query function (although, of course, in a much more general way), it brings a new level of automation to the query construction, particularly for nested hierarchical data sets. In a way, query is a toy example of a compiler from the simple declarative query specification to an actual query that can be run. The more complex is the query, the more this layer will buy you.

Conclusions

So, at least for me, the presence of Associations in the language (as well as the operator form of many functions) not just simplifies many tasks, but actually opens new ways of thinking and programming.

POSTED BY: Leonid Shifrin

enter image description here - Congratulations! This post is now a Staff Pick as distinguished on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract