Hi Vincent,
Good point, the last version was not optimized for speed. I changed the code slightly above; the old code is included as a comment... This new code is quite a bit faster than the previous one:
ClearAll[BinListsByOld]
BinListsByOld[data_List,binspecs__List]:=Module[{fs,idata,len,out},
len=Length[data];
fs={binspecs};
If[AllTrue[fs,2<=Length[#]<=4&],
idata=Table[Map[f,data],{f,fs[[All,1]]}];
AppendTo[idata,Range[len]];
out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
out/.Thread[Range[len]->data]
,
Print["Your specification is incorrect\[Ellipsis]"];
]
]
ClearAll[BinListsBy]
BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out},
len=Length[data];
fs={binspecs};
If[AllTrue[fs,2<=Length[#]<=4&],
idata=Table[Map[f,data],{f,fs[[All,1]]}];
AppendTo[idata,Range[len]];
out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
Map[data[[#]]&,out,{Length[fs]}]
,
Print["Your specification is incorrect\[Ellipsis]"];
]
]
ClearAll[RandomString]
RandomString[chars_List,n_Integer]:=StringJoin[RandomChoice[chars,n]]
RandomString[n_Integer]:=RandomString[CharacterRange["a","z"],n]
Now doing the benchmark:
timings=Table[
data=Transpose[{RandomInteger[100,n],Table[RandomString[RandomInteger[{5,10}]],{n}]}];
{{n,AbsoluteTiming[out=BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[out=BinListsByOld[data,{First,0,100,10}];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];
Show[{ListLogLogPlot[Transpose[timings], Frame -> True,
FrameLabel -> {"size", "timing"},
PlotLegends -> {"BinListsBy New", "BinListsBy Old"}],
LogLogPlot[{0.000003 x, 0.00000003 x^2}, {x, 10, 10^5}]}]
You can see that the new code performs much better actually, moreover, it scales much better: linearly versus quadratically!
In order to compare it with 'regular' binning, we can do a problem that can also be done by regular BinLists:
timings2=Table[
data=Transpose[{RandomInteger[100,n],RandomReal[100,n]}];
{{n,AbsoluteTiming[BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinListsByOld[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinLists[data,{0,100,10},{-1000,1000,2000}][[All,1]];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];
Show[{ListLogLogPlot[Transpose[timings2], Frame -> True,
FrameLabel -> {"size", "timing"},
PlotLegends -> {"BinListsBy New", "BinListsBy Old", "BinLists"}],
LogLogPlot[{0.000003 x, 0.00000003 x^2, 0.0000017 x}, {x, 10,
10^5}]}]
You can see that both the regular BinLists and the new BinListsBy are very close, and both scale linearly (as expected). It is slightly slower than BinLists, but that is expected because it uses BinLists internally + some extra stuff.