Message Boards Message Boards

7
|
20171 Views
|
9 Replies
|
22 Total Likes
View groups...
Share
Share this post:

How-To: Equivalent of R's dplyr::summarize in Wolfram Language

POSTED BY: Seth Chandler
9 Replies

enter image description here - another post of yours has been selected for the Staff Picks group, congratulations !

We are happy to see you at the top of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: EDITORIAL BOARD

FWIW, I will be presenting a talk on Thursday at the Wolfram Technology Conference that presents an early and crude effort to emulate good chunks of dplyr using Mathematica operators. The talk is on the Affordable Care Act but I found that, in order to analyze large chunks of data involving the law, I needed to learn a fair amount about database operations. The idea is to provide a syntax that is somewhat dplyr-like (i.e. emphasis on postfix operators) but that also uses ideas such as Mathematica pure functions. I am not yet posting it because I am VERY AWARE that it has lots of issues and would like some feedback first. In the not-too-long-run, however, I think it would be very worthwhile for the Wolfram Documentation to contain more examples of how to perform common database operations (such as GroupBy and summarize) for various forms of Mathematica Datasets (list of lists, list of associations, association of lists, and association of associations). And I think some higher-level functions and operators would be useful too. Hopefully, my efforts will catalyze a dialog on that point.

POSTED BY: Seth Chandler
Posted 10 years ago

Just to give another example, I saw this R code under the first search result for "dplyr":

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

Here is the output:

enter image description here

Here is equivalent Mathematica 10.1 code and a similar chart.

flights[GroupBy@
     "tailnum", <|"count" -> Length@#, 
      "dist" -> N@Mean[#[[All, "distance"]]], 
      "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // 
   Select[#count > 20 && #dist < 2000 &] // 
  Map[{#dist, #delay, #count} &] // 
 BubbleChart[#, ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]},
    BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, 
   GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &

enter image description here

POSTED BY: Michael Hale

Michael, this is pretty neat! Is there a simple way to get the data for your flights (I guess you built a Dataset). And what do you think they used for the blue curve?

POSTED BY: Sam Carrettie
Posted 10 years ago

Hi Sam, Yes, I exported the sample dataset from R with write.csv() and then I used SemanticImport in Mathematica to convert it to a dataset. For a dataset of that size SemanticImport takes a minute. Maybe you could speed it up by manually specifying the column types, but a minute is fine for this. The blue curve and shading are a predicted mean and confidence interval. It's produced by the geom_smooth() function in the ggplot R package. According to their documentation, they use a generalized additive model for data sets larger than 2000 points, confidence bands of 95%, and some heuristics/meta-algorithms to select smoothness that I didn't dig into. I think a linear model of degree 10 is fine for this illustration.

flights = SemanticImport["E:\\flights.csv"];

summary = 
  flights[GroupBy@
     "tailnum", <|"count" -> Length@#, 
      "dist" -> N@Mean[#[[All, "distance"]]], 
      "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // 
   Select[#count > 20 && #dist < 2000 && NumberQ@#delay &];

lm = LinearModelFit[{#dist, #delay} & /@ Values@Normal@summary, 
   Table[x^n, {n, 10}], x];
bands = lm["MeanPredictionBands"];

bc = summary // Map[{#dist, #delay, #count} &] // 
   BubbleChart[#, 
     ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]}, 
     BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, 
     GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &;

Show[bc, Plot[bands, {x, 170, 2000}, PlotStyle -> None, 
  FillingStyle -> {Blue, Opacity@.75}, Filling -> {1 -> {2}}], 
 Plot[lm[x], {x, 170, 2000}]]

enter image description here

POSTED BY: Michael Hale

I am pretty confident everything in dplyr can be done in Mathematica. The question is coming up for an architecture that is (a) Mathematica-like, (b) simple and (c ) expressive for doing so. As I see it, dplyr is mostly two things: (a) use of a convenient forward chaining operator %>% taken from the magrittr package and (b) some very common database commands, many of which are already easy to do using Mathematica datasets. Magrittr style chaining, which is essentially a form of postfix notation designed to enhance readability, may well be possible using combinations of /* and // and the new operator notation forms available for commands such as Select and GroupBy. As far as the common commands go, everything one needs is already in Mathematica, it is just a matter of writing some wrappers (to implement mutate, for example) and creating some sort of equivalency table such as dplyr filter = Mathematica Select, etc.. I actually have to get on an airplane shortly, but I will give this some additional thought.

POSTED BY: Seth Chandler

As it turns out, the Wolfram Language now has a convenient way of chaining: the right composition operator /*. I'm currently producing a L O N G work on how to extract information from lists of Associations, nested Associations and Dataset that makes significant use of the RightComposition construct.

POSTED BY: Seth Chandler

Very nice, Seth, thanks for taking the time! Do you think it would be hard to reproduce all functionality of dplyr in Mathematica? Which parts of dplyr do you think are the best to have natively?

POSTED BY: Sam Carrettie

Seth, this is great, thanks for sharing. For such a nice content I felt it is worth mirroring the the notebook right in the post and went ahead with it. I am glad you are taking advantage of the Dataset!

POSTED BY: Vitaliy Kaurov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract