Group Abstract

Message Boards

WOLFRAM COMMUNITY

22.1K Views

9 Replies

22 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

How-To: Equivalent of R's dplyr::summarize in Wolfram Language

Seth Chandler

Seth Chandler, University of Houston

Posted 10 years ago

POSTED BY: Seth Chandler

9 Replies

Sort By:

Seth Chandler

Seth Chandler, University of Houston

Posted 7 years ago

POSTED BY: Seth Chandler

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 9 years ago

- another post of yours has been selected for the Staff Picks group, congratulations ! We are happy to see you at the top of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: EDITORIAL BOARD

Seth Chandler

Seth Chandler, University of Houston

Posted 10 years ago

FWIW, I will be presenting a talk on Thursday at the Wolfram Technology Conference that presents an early and crude effort to emulate good chunks of dplyr using Mathematica operators. The talk is on the Affordable Care Act but I found that, in order to analyze large chunks of data involving the law, I needed to learn a fair amount about database operations. The idea is to provide a syntax that is somewhat dplyr-like (i.e. emphasis on postfix operators) but that also uses ideas such as Mathematica pure functions. I am not yet posting it because I am VERY AWARE that it has lots of issues and would like some feedback first. In the not-too-long-run, however, I think it would be very worthwhile for the Wolfram Documentation to contain more examples of how to perform common database operations (such as GroupBy and summarize) for various forms of Mathematica Datasets (list of lists, list of associations, association of lists, and association of associations). And I think some higher-level functions and operators would be useful too. Hopefully, my efforts will catalyze a dialog on that point.

POSTED BY: Seth Chandler

Michael Hale

Posted 10 years ago

Hi Sam, Yes, I exported the sample dataset from R with write.csv() and then I used SemanticImport in Mathematica to convert it to a dataset. For a dataset of that size SemanticImport takes a minute. Maybe you could speed it up by manually specifying the column types, but a minute is fine for this. The blue curve and shading are a predicted mean and confidence interval. It's produced by the geom_smooth() function in the ggplot R package. According to their documentation, they use a generalized additive model for data sets larger than 2000 points, confidence bands of 95%, and some heuristics/meta-algorithms to select smoothness that I didn't dig into. I think a linear model of degree 10 is fine for this illustration. flights = SemanticImport["E:\\flights.csv"]; summary = flights[GroupBy@ "tailnum", <\|"count" -> Length@#, "dist" -> N@Mean[#[[All, "distance"]]], "delay" -> N@Mean[#[[All, "arr_delay"]]]\|> &] // Select[#count > 20 && #dist < 2000 && NumberQ@#delay &]; lm = LinearModelFit[{#dist, #delay} & /@ Values@Normal@summary, Table[x^n, {n, 10}], x]; bands = lm["MeanPredictionBands"]; bc = summary // Map[{#dist, #delay, #count} &] // BubbleChart[#, ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]}, BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &; Show[bc, Plot[bands, {x, 170, 2000}, PlotStyle -> None, FillingStyle -> {Blue, Opacity@.75}, Filling -> {1 -> {2}}], Plot[lm[x], {x, 170, 2000}]]

Hi Sam, Yes, I exported the sample dataset from R with write.csv() and then I used SemanticImport in Mathematica to convert it to a dataset. For a dataset of that size SemanticImport takes a minute. Maybe you could speed it up by manually specifying the column types, but a minute is fine for this. The blue curve and shading are a predicted mean and confidence interval. It's produced by the geom_smooth() function in the ggplot R package. According to their documentation, they use a generalized additive model for data sets larger than 2000 points, confidence bands of 95%, and some heuristics/meta-algorithms to select smoothness that I didn't dig into. I think a linear model of degree 10 is fine for this illustration.

flights = SemanticImport["E:\\flights.csv"];

summary = 
  flights[GroupBy@
     "tailnum", <|"count" -> Length@#, 
      "dist" -> N@Mean[#[[All, "distance"]]], 
      "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // 
   Select[#count > 20 && #dist < 2000 && NumberQ@#delay &];

lm = LinearModelFit[{#dist, #delay} & /@ Values@Normal@summary, 
   Table[x^n, {n, 10}], x];
bands = lm["MeanPredictionBands"];

bc = summary // Map[{#dist, #delay, #count} &] // 
   BubbleChart[#, 
     ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]}, 
     BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, 
     GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &;

Show[bc, Plot[bands, {x, 170, 2000}, PlotStyle -> None, 
  FillingStyle -> {Blue, Opacity@.75}, Filling -> {1 -> {2}}], 
 Plot[lm[x], {x, 170, 2000}]]

enter image description here

POSTED BY: Michael Hale

Sam Carrettie

Sam Carrettie, Freelancer

Posted 10 years ago

Michael, this is pretty neat! Is there a simple way to get the data for your flights (I guess you built a Dataset). And what do you think they used for the blue curve?

POSTED BY: Sam Carrettie

Michael Hale

Posted 10 years ago

Just to give another example, I saw this R code under the first search result for "dplyr": by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000) # Interestingly, the average delay is only slightly related to the # average distance flown by a plane. ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area() Here is the output: Here is equivalent Mathematica 10.1 code and a similar chart. flights[GroupBy@ "tailnum", <\|"count" -> Length@#, "dist" -> N@Mean[#[[All, "distance"]]], "delay" -> N@Mean[#[[All, "arr_delay"]]]\|> &] // Select[#count > 20 && #dist < 2000 &] // Map[{#dist, #delay, #count} &] // BubbleChart[#, ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]}, BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &

Just to give another example, I saw this R code under the first search result for "dplyr":

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

Here is the output:

enter image description here

Here is equivalent Mathematica 10.1 code and a similar chart.

flights[GroupBy@
     "tailnum", <|"count" -> Length@#, 
      "dist" -> N@Mean[#[[All, "distance"]]], 
      "delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] // 
   Select[#count > 20 && #dist < 2000 &] // 
  Map[{#dist, #delay, #count} &] // 
 BubbleChart[#, ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]},
    BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2, 
   GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &

enter image description here

POSTED BY: Michael Hale

Seth Chandler

Seth Chandler, University of Houston

Posted 10 years ago

I am pretty confident everything in dplyr can be done in Mathematica. The question is coming up for an architecture that is (a) Mathematica-like, (b) simple and (c ) expressive for doing so. As I see it, dplyr is mostly two things: (a) use of a convenient forward chaining operator %>% taken from the magrittr package and (b) some very common database commands, many of which are already easy to do using Mathematica datasets. Magrittr style chaining, which is essentially a form of postfix notation designed to enhance readability, may well be possible using combinations of /* and // and the new operator notation forms available for commands such as Select and GroupBy. As far as the common commands go, everything one needs is already in Mathematica, it is just a matter of writing some wrappers (to implement mutate, for example) and creating some sort of equivalency table such as dplyr filter = Mathematica Select, etc.. I actually have to get on an airplane shortly, but I will give this some additional thought.

POSTED BY: Seth Chandler

Sam Carrettie

Sam Carrettie, Freelancer

Posted 10 years ago

Very nice, Seth, thanks for taking the time! Do you think it would be hard to reproduce all functionality of dplyr in Mathematica? Which parts of dplyr do you think are the best to have natively?

POSTED BY: Sam Carrettie

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 10 years ago

POSTED BY: Vitaliy Kaurov

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback