Hi Sam,
Yes, I exported the sample dataset from R with write.csv() and then I used SemanticImport in Mathematica to convert it to a dataset. For a dataset of that size SemanticImport takes a minute. Maybe you could speed it up by manually specifying the column types, but a minute is fine for this. The blue curve and shading are a predicted mean and confidence interval. It's produced by the geom_smooth() function in the ggplot R package. According to their documentation, they use a generalized additive model for data sets larger than 2000 points, confidence bands of 95%, and some heuristics/meta-algorithms to select smoothness that I didn't dig into. I think a linear model of degree 10 is fine for this illustration.
flights = SemanticImport["E:\\flights.csv"];
summary =
flights[GroupBy@
"tailnum", <|"count" -> Length@#,
"dist" -> N@Mean[#[[All, "distance"]]],
"delay" -> N@Mean[#[[All, "arr_delay"]]]|> &] //
Select[#count > 20 && #dist < 2000 && NumberQ@#delay &];
lm = LinearModelFit[{#dist, #delay} & /@ Values@Normal@summary,
Table[x^n, {n, 10}], x];
bands = lm["MeanPredictionBands"];
bc = summary // Map[{#dist, #delay, #count} &] //
BubbleChart[#,
ChartBaseStyle -> {Black, Opacity@.5, EdgeForm[None]},
BubbleSizes -> {0.005, 0.1}, AspectRatio -> 1/2,
GridLines -> Automatic, FrameLabel -> {"Dist", "Delay"}] &;
Show[bc, Plot[bands, {x, 170, 2000}, PlotStyle -> None,
FillingStyle -> {Blue, Opacity@.75}, Filling -> {1 -> {2}}],
Plot[lm[x], {x, 170, 2000}]]
