Message Boards Message Boards

How-To: Equivalent of R's dplyr::summarize in Wolfram Language

GROUPS:

I have been doing a lot of work with databases recently, some in Mathematica and some in R using the latter's very nice dplyr package.. Since I had trouble with the issue, I thought it might be worthwhile to share how to perform the equivalent of a "summarize" operation in R. The idea is to group data and to then produce a new dataset that computes, for each resulting group, various composite statistics associated with that group. It may be that there are simpler ways of reaching the same result, in which case you can regard this as a way of stimulating discussion on the matter. Also, I am fully aware that one can use the lovely RLink package to run R code from within Mathematica, including the dplyr package. And, indeed, sometimes this is very useful. Still, there are occasions in which one does not want the overhead of coordinating with R and wants a pure Mathematica solution. Hence this post.

I am linking to a Wolfram Cloud notebook that should provide the necessary ideas and code. A copy of the content below.


The equivalent of dplyr's "summarize" in Mathematica

The R language features a package called "dplyr" that is widely used for analyzing data. One of the key functions used in dplyr is called summarize. The idea is to take some data, group it according to some common value and then find some summary statistics on each grouping of the data. So, the question is, how do we do this in Mathematica? Since at least I did not find it obvious at first, I thought this little tutorial would be helpful. If there are better approaches, I am hoping this stimulates discussion.

Let's build a little dataset. We'll construct it as an association of associations, meaning it will display as something with column names and row names

SeedRandom[100];
ds = Dataset@AssociationThread[CharacterRange["A","E"],
Map[AssociationThread[CharacterRange["a","g"],#]&,RandomInteger[{1,4},{5,7}]]]

enter image description here

Now let's group the data by its b value.

dg = Query[GroupBy[#b&]]@ds

enter image description here

Notice that we now have an association of association of associations. We could get an association of a list of associations if we first applied Values to the data before grouping it.

dv = Query[Values/*GroupBy[#b&]]@ds

enter image description here

Suppose that we want the mean value of d for all of the groupings and the maximum value of e and we want the new columns to have the labels "meand" and "maxe" . In R, we would write something like "ds %>% group_by(b) %>% summarize(meand= mean(d), maxe=max(e))". How do we do it in Mathematica? The key is to look at the structure of the values that have been produced by GroupBy. Since dg is an association of association of associations, here's what we need to do.

Let's start by writing a function h that takes a list of associations (r) and produces a new association in which the key "meand" takes on the value of the mean of a "Through" of the list of associations (which can act as functions) operating on "d" and in which the key "maxe" takes on the value of the max of a similar "Through" of the list of associations operating on "e".

h=r\[Function]Association["meand"->Mean@Through[r["d"]],"maxe"->Max@Through[r["e"]]]

If each row of our data is an Association, as is the case for dg, we need to take the Values of that association for each row and then apply the function h to the result.

Query[All,Values/*h]@dg

enter image description here

If each row of our data is a list of Associations, as is the case for dv, then we can just apply h directly to each row.

Query[All,h]@dv

enter image description here

And thats basically it. We can now treat the resulting Dataset as something we can in turn manipulate.

Query[All,Join[#,Association["v1"->Min[#meand,#maxe],"v2"->#meand+#maxe]]&]@Query[All,h]@dv

enter image description here

Or if we were ambitious and wanted to do it in one step:

Query[All,h/*(Join[#,Association["v1"->Min[#meand,#maxe],"v2"->#meand+#maxe]]&)]@dv

enter image description here

POSTED BY: Seth Chandler
Answer
1 year ago

Seth, this is great, thanks for sharing. For such a nice content I felt it is worth mirroring the the notebook right in the post and went ahead with it. I am glad you are taking advantage of the Dataset!

POSTED BY: Vitaliy Kaurov
Answer
1 year ago

Very nice, Seth, thanks for taking the time! Do you think it would be hard to reproduce all functionality of dplyr in Mathematica? Which parts of dplyr do you think are the best to have natively?

POSTED BY: Sam Carrettie
Answer
1 year ago

I am pretty confident everything in dplyr can be done in Mathematica. The question is coming up for an architecture that is (a) Mathematica-like, (b) simple and (c ) expressive for doing so. As I see it, dplyr is mostly two things: (a) use of a convenient forward chaining operator %>% taken from the magrittr package and (b) some very common database commands, many of which are already easy to do using Mathematica datasets. Magrittr style chaining, which is essentially a form of postfix notation designed to enhance readability, may well be possible using combinations of /* and // and the new operator notation forms available for commands such as Select and GroupBy. As far as the common commands go, everything one needs is already in Mathematica, it is just a matter of writing some wrappers (to implement mutate, for example) and creating some sort of equivalency table such as dplyr filter = Mathematica Select, etc.. I actually have to get on an airplane shortly, but I will give this some additional thought.

POSTED BY: Seth Chandler
Answer
1 year ago

Just to give another example, I saw this R code under the first search result for "dplyr":

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

Here is the output:

enter image description here

Here is equivalent Mathematica 10.1 code and a similar chart.

flights[GroupBy@
     "tailnum", <|"count" -> Length@#, 
      "dist" -> N@Mean[#[[All, "distance"]]], 
      "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // 
   Select[#count > 20 && #dist < 2000 &] // 
  Map[{#dist, #delay, #count} &] // 
 BubbleChart[#, ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]},
    BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, 
   GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &

enter image description here

POSTED BY: Michael Hale
Answer
1 year ago

Michael, this is pretty neat! Is there a simple way to get the data for your flights (I guess you built a Dataset). And what do you think they used for the blue curve?

POSTED BY: Sam Carrettie
Answer
1 year ago

Hi Sam, Yes, I exported the sample dataset from R with write.csv() and then I used SemanticImport in Mathematica to convert it to a dataset. For a dataset of that size SemanticImport takes a minute. Maybe you could speed it up by manually specifying the column types, but a minute is fine for this. The blue curve and shading are a predicted mean and confidence interval. It's produced by the geom_smooth() function in the ggplot R package. According to their documentation, they use a generalized additive model for data sets larger than 2000 points, confidence bands of 95%, and some heuristics/meta-algorithms to select smoothness that I didn't dig into. I think a linear model of degree 10 is fine for this illustration.

flights = SemanticImport["E:\\flights.csv"];

summary = 
  flights[GroupBy@
     "tailnum", <|"count" -> Length@#, 
      "dist" -> N@Mean[#[[All, "distance"]]], 
      "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // 
   Select[#count > 20 && #dist < 2000 && NumberQ@#delay &];

lm = LinearModelFit[{#dist, #delay} & /@ Values@Normal@summary, 
   Table[x^n, {n, 10}], x];
bands = lm["MeanPredictionBands"];

bc = summary // Map[{#dist, #delay, #count} &] // 
   BubbleChart[#, 
     ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]}, 
     BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, 
     GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &;

Show[bc, Plot[bands, {x, 170, 2000}, PlotStyle -> None, 
  FillingStyle -> {Blue, Opacity@.75}, Filling -> {1 -> {2}}], 
 Plot[lm[x], {x, 170, 2000}]]

enter image description here

POSTED BY: Michael Hale
Answer
1 year ago

FWIW, I will be presenting a talk on Thursday at the Wolfram Technology Conference that presents an early and crude effort to emulate good chunks of dplyr using Mathematica operators. The talk is on the Affordable Care Act but I found that, in order to analyze large chunks of data involving the law, I needed to learn a fair amount about database operations. The idea is to provide a syntax that is somewhat dplyr-like (i.e. emphasis on postfix operators) but that also uses ideas such as Mathematica pure functions. I am not yet posting it because I am VERY AWARE that it has lots of issues and would like some feedback first. In the not-too-long-run, however, I think it would be very worthwhile for the Wolfram Documentation to contain more examples of how to perform common database operations (such as GroupBy and summarize) for various forms of Mathematica Datasets (list of lists, list of associations, association of lists, and association of associations). And I think some higher-level functions and operators would be useful too. Hopefully, my efforts will catalyze a dialog on that point.

POSTED BY: Seth Chandler
Answer
1 year ago

enter image description here - another post of yours has been selected for the Staff Picks group, congratulations !

We are happy to see you at the top of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: Moderation Team
Answer
1 year ago

Group Abstract Group Abstract