# How-To: Equivalent of R's dplyr::summarize in Wolfram Language

GROUPS:

I have been doing a lot of work with databases recently, some in Mathematica and some in R using the latter's very nice dplyr package.. Since I had trouble with the issue, I thought it might be worthwhile to share how to perform the equivalent of a "summarize" operation in R. The idea is to group data and to then produce a new dataset that computes, for each resulting group, various composite statistics associated with that group. It may be that there are simpler ways of reaching the same result, in which case you can regard this as a way of stimulating discussion on the matter. Also, I am fully aware that one can use the lovely RLink package to run R code from within Mathematica, including the dplyr package. And, indeed, sometimes this is very useful. Still, there are occasions in which one does not want the overhead of coordinating with R and wants a pure Mathematica solution. Hence this post.

I am linking to a Wolfram Cloud notebook that should provide the necessary ideas and code. A copy of the content below.

## The equivalent of dplyr's "summarize" in Mathematica

The R language features a package called "dplyr" that is widely used for analyzing data. One of the key functions used in dplyr is called summarize. The idea is to take some data, group it according to some common value and then find some summary statistics on each grouping of the data. So, the question is, how do we do this in Mathematica? Since at least I did not find it obvious at first, I thought this little tutorial would be helpful. If there are better approaches, I am hoping this stimulates discussion.

Let's build a little dataset. We'll construct it as an association of associations, meaning it will display as something with column names and row names

SeedRandom[100];


Now let's group the data by its b value.

dg = Query[GroupBy[#b&]]@ds


Notice that we now have an association of association of associations. We could get an association of a list of associations if we first applied Values to the data before grouping it.

dv = Query[Values/*GroupBy[#b&]]@ds


Suppose that we want the mean value of d for all of the groupings and the maximum value of e and we want the new columns to have the labels "meand" and "maxe" . In R, we would write something like "ds %>% group_by(b) %>% summarize(meand= mean(d), maxe=max(e))". How do we do it in Mathematica? The key is to look at the structure of the values that have been produced by GroupBy. Since dg is an association of association of associations, here's what we need to do.

Let's start by writing a function h that takes a list of associations (r) and produces a new association in which the key "meand" takes on the value of the mean of a "Through" of the list of associations (which can act as functions) operating on "d" and in which the key "maxe" takes on the value of the max of a similar "Through" of the list of associations operating on "e".

h=r\[Function]Association["meand"->Mean@Through[r["d"]],"maxe"->Max@Through[r["e"]]]


If each row of our data is an Association, as is the case for dg, we need to take the Values of that association for each row and then apply the function h to the result.

Query[All,Values/*h]@dg


If each row of our data is a list of Associations, as is the case for dv, then we can just apply h directly to each row.

Query[All,h]@dv


And thats basically it. We can now treat the resulting Dataset as something we can in turn manipulate.

Query[All,Join[#,Association["v1"->Min[#meand,#maxe],"v2"->#meand+#maxe]]&]@Query[All,h]@dv


Or if we were ambitious and wanted to do it in one step:

Query[All,h/*(Join[#,Association["v1"->Min[#meand,#maxe],"v2"->#meand+#maxe]]&)]@dv


2 years ago
8 Replies
 Vitaliy Kaurov 1 Vote Seth, this is great, thanks for sharing. For such a nice content I felt it is worth mirroring the the notebook right in the post and went ahead with it. I am glad you are taking advantage of the Dataset!
2 years ago
 Sam Carrettie 4 Votes Very nice, Seth, thanks for taking the time! Do you think it would be hard to reproduce all functionality of dplyr in Mathematica? Which parts of dplyr do you think are the best to have natively?
2 years ago
 Seth Chandler 3 Votes I am pretty confident everything in dplyr can be done in Mathematica. The question is coming up for an architecture that is (a) Mathematica-like, (b) simple and (c ) expressive for doing so. As I see it, dplyr is mostly two things: (a) use of a convenient forward chaining operator %>% taken from the magrittr package and (b) some very common database commands, many of which are already easy to do using Mathematica datasets. Magrittr style chaining, which is essentially a form of postfix notation designed to enhance readability, may well be possible using combinations of /* and // and the new operator notation forms available for commands such as Select and GroupBy. As far as the common commands go, everything one needs is already in Mathematica, it is just a matter of writing some wrappers (to implement mutate, for example) and creating some sort of equivalency table such as dplyr filter = Mathematica Select, etc.. I actually have to get on an airplane shortly, but I will give this some additional thought.
2 years ago
 Michael Hale 3 Votes Just to give another example, I saw this R code under the first search result for "dplyr": by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000) # Interestingly, the average delay is only slightly related to the # average distance flown by a plane. ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area() Here is the output:Here is equivalent Mathematica 10.1 code and a similar chart. flights[GroupBy@ "tailnum", <|"count" -> Length@#, "dist" -> N@Mean[#[[All, "distance"]]], "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // Select[#count > 20 && #dist < 2000 &] // Map[{#dist, #delay, #count} &] // BubbleChart[#, ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]}, BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] & 
2 years ago
 Michael, this is pretty neat! Is there a simple way to get the data for your flights (I guess you built a Dataset). And what do you think they used for the blue curve?
 Michael Hale 1 Vote Hi Sam, Yes, I exported the sample dataset from R with write.csv() and then I used SemanticImport in Mathematica to convert it to a dataset. For a dataset of that size SemanticImport takes a minute. Maybe you could speed it up by manually specifying the column types, but a minute is fine for this. The blue curve and shading are a predicted mean and confidence interval. It's produced by the geom_smooth() function in the ggplot R package. According to their documentation, they use a generalized additive model for data sets larger than 2000 points, confidence bands of 95%, and some heuristics/meta-algorithms to select smoothness that I didn't dig into. I think a linear model of degree 10 is fine for this illustration. flights = SemanticImport["E:\\flights.csv"]; summary = flights[GroupBy@ "tailnum", <|"count" -> Length@#, "dist" -> N@Mean[#[[All, "distance"]]], "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // Select[#count > 20 && #dist < 2000 && NumberQ@#delay &]; lm = LinearModelFit[{#dist, #delay} & /@ Values@Normal@summary, Table[x^n, {n, 10}], x]; bands = lm["MeanPredictionBands"]; bc = summary // Map[{#dist, #delay, #count} &] // BubbleChart[#, ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]}, BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &; Show[bc, Plot[bands, {x, 170, 2000}, PlotStyle -> None, FillingStyle -> {Blue, Opacity@.75}, Filling -> {1 -> {2}}], Plot[lm[x], {x, 170, 2000}]]