Message Boards Message Boards

9
|
15015 Views
|
5 Replies
|
14 Total Likes
View groups...
Share
Share this post:

BinListsBy - Binning data with associate data

POSTED BY: Sander Huisman
5 Replies

enter image description here - Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

POSTED BY: EDITORIAL BOARD

I've wanted something like this for a long time too. It has been a while since I've needed it, but if I recall correctly, the application was a sliding window of fixed time-width across irregularly sampled data.

However, I think recent improvements in the TimeSeries subsystem have addressed my specific need.

POSTED BY: Vincent Virgilio

I've used quite often actually, and coded the same thing in multiple ways over the years... It can also be implemented by a GatherBy route; you use GatherBy to do the hard work, and then make bins yourself and replace the values....

POSTED BY: Sander Huisman

Again IIRC, performance was an issue. I was trying to apply something like BinListsBy to very long lists. My implementation was never as performant as desired.

Have you done any performance assessments on BinListsBy?

POSTED BY: Vincent Virgilio

Hi Vincent,

Good point, the last version was not optimized for speed. I changed the code slightly above; the old code is included as a comment... This new code is quite a bit faster than the previous one:

ClearAll[BinListsByOld]
BinListsByOld[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs,2<=Length[#]<=4&],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        out/.Thread[Range[len]->data]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]
ClearAll[BinListsBy]
BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs,2<=Length[#]<=4&],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        Map[data[[#]]&,out,{Length[fs]}]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]
ClearAll[RandomString]
RandomString[chars_List,n_Integer]:=StringJoin[RandomChoice[chars,n]]
RandomString[n_Integer]:=RandomString[CharacterRange["a","z"],n]

Now doing the benchmark:

timings=Table[
data=Transpose[{RandomInteger[100,n],Table[RandomString[RandomInteger[{5,10}]],{n}]}];
{{n,AbsoluteTiming[out=BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[out=BinListsByOld[data,{First,0,100,10}];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];

Show[{ListLogLogPlot[Transpose[timings], Frame -> True, 
   FrameLabel -> {"size", "timing"}, 
   PlotLegends -> {"BinListsBy New", "BinListsBy Old"}], 
  LogLogPlot[{0.000003 x, 0.00000003 x^2}, {x, 10, 10^5}]}]

enter image description here

You can see that the new code performs much better actually, moreover, it scales much better: linearly versus quadratically!

In order to compare it with 'regular' binning, we can do a problem that can also be done by regular BinLists:

timings2=Table[
data=Transpose[{RandomInteger[100,n],RandomReal[100,n]}];
{{n,AbsoluteTiming[BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinListsByOld[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinLists[data,{0,100,10},{-1000,1000,2000}][[All,1]];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];

Show[{ListLogLogPlot[Transpose[timings2], Frame -> True, 
   FrameLabel -> {"size", "timing"}, 
   PlotLegends -> {"BinListsBy New", "BinListsBy Old", "BinLists"}], 
  LogLogPlot[{0.000003 x, 0.00000003 x^2, 0.0000017 x}, {x, 10, 
    10^5}]}]

enter image description here

You can see that both the regular BinLists and the new BinListsBy are very close, and both scale linearly (as expected). It is slightly slower than BinLists, but that is expected because it uses BinLists internally + some extra stuff.

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract