# BinListsBy - Binning data with associate data

Posted 4 years ago
8258 Views
|
5 Replies
|
13 Total Likes
|

Note that I changed the code a bit on May 8th, 2017. Now the function is always mapped--independent of the dimensions of the input data.

A common problem is that you want to bin data by (e.g.) the x coordinate, while maintaining some associated data y with it. We can define a new function BinListsBy to bin n-dimensional data in m dimensions:

ClearAll[BinListsBy]
BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out},
len=Length[data];
fs={binspecs};
If[AllTrue[fs, MatchQ[#, {_, _?NumericQ, _?NumericQ} | {_, _?NumericQ, _?NumericQ, _?NumericQ}] &],
idata=Table[Map[f,data],{f,fs[[All,1]]}];
AppendTo[idata,Range[len]];
out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
Map[data[[#]] &, out, {Length[fs]}]
,
]
]

?BinListsBy


## Basic Examples

Bin 2D data in only 1 dimension:

data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}};
out=BinListsBy[data,{First,0,100,10}]
out//Grid


The data can be of any type:

data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}};
BinListsBy[data,{First,-3,10}]
Column[%, Frame -> All]


will output:

{{},{},{{-0.7,ok?}},{},{{1.2,This}},{{2.4,test}},{{3.1,some},{3.14,Sander}},{{4.5,just}},{{5.1,is}},{{6.2,from}},{{7.1,data}},{},{}}


such that it is binned using the first element, and the associate data is kept.

Bin based on the length of a string:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{StringLength,1,10}]


giving:

{{},{is},{ok?},{This,just,some,test,data,from},{},{Sander},{},{},{}}


Of course we can also bin 2D data in a single dimension based on the length of a string:

data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}};
BinListsBy[data,{StringLength@*Last,1,10}]
Column[%,Frame->All]


giving:

{{},{{5.1,is}},{{-0.7,ok?}},{{1.2,This},{4.5,just},{3.1,some},{2.4,test},{7.1,data},{6.2,from}},{},{{3.14,Sander}},{},{},{}}


where now strings of equal length are binned using a {1,10} specification.

## Scope

We can also do higher number of dimensions:

data={{1.2,2.3,"This"},{5.1,1.2 ,"is"},{4.5,3.1,"just"},{3.1,1.87,"some"},{2.4,1.6,"test"},{7.1,1.4,"data"},{6.2,7.3,"from"},{3.14,9.1,"Sander"},{-0.7,0.8,"ok?"}};
BinListsBy[data,{First,-2,8},{#[[2]]&,0,10},{StringLength@*Last,1,7,1}];
Grid[Map[Column,%,{2}],Frame->All]


which nicely outputs a 3D list of binned data.

Using Composition and RightComposition or pure functions can make elaborate functionality. Here we bin based on number of unique characters:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{Length@*DeleteDuplicates@*Characters@*ToLowerCase,1,10}]


giving:

{{},{is},{test,data,ok?},{This,just,some,from},{},{Sander},{},{},{}}


Note that the data can have different lengths and types or structures:

data={{1.3,{1,1}},{2.3,{1,1,1}},{4.14,1,2,"test"},{2.5,{2,1},2},{4.3}};
BinListsBy[data,{First,0,6}];
Column[%]


Or by a criteria that works based on aggregate functions (Total, Mean):

data={{1,2},{4,5},{-2,5},{3,6},{2,3},{1,1}}
BinListsBy[data,{Total,1,10}];
Column[%]


## Properties & Relations

If the data is purely numeric, real, and not ragged, one can specify very large bins for BinLists to capture all the data in the other dimension:

data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}};
BinLists[data,{0,100,10},{-10000,10000,20000}][[All,1]]===BinListsBy[data,{First,0,100,10}]


## Neat Examples

One can bin 1D data in 2D:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{Length@*DeleteDuplicates@*Characters@*ToLowerCase,1,10},{StringLength,0,10}]//Grid


So we first bin by unique characters, and then by string length:

## Final thoughts

I've been advocating for this functionality for a while, it is basically BinLists but with an extra By like GatherBy and SortBy such that nD data can be binned in mD (which is currently not possible!). It is indeed very useful for data processing where it is very common to bin by a certain value, while wanting to maintain some auxiliary data (for example an ID of the data, or time-stamp or ...).

5 Replies
Sort By:
Posted 4 years ago
 I've wanted something like this for a long time too. It has been a while since I've needed it, but if I recall correctly, the application was a sliding window of fixed time-width across irregularly sampled data.However, I think recent improvements in the TimeSeries subsystem have addressed my specific need.
Posted 4 years ago
 I've used quite often actually, and coded the same thing in multiple ways over the years... It can also be implemented by a GatherBy route; you use GatherBy to do the hard work, and then make bins yourself and replace the values....
Posted 4 years ago
 Again IIRC, performance was an issue. I was trying to apply something like BinListsBy to very long lists. My implementation was never as performant as desired.Have you done any performance assessments on BinListsBy?
 Hi Vincent,Good point, the last version was not optimized for speed. I changed the code slightly above; the old code is included as a comment... This new code is quite a bit faster than the previous one: ClearAll[BinListsByOld] BinListsByOld[data_List,binspecs__List]:=Module[{fs,idata,len,out}, len=Length[data]; fs={binspecs}; If[AllTrue[fs,2<=Length[#]<=4&], idata=Table[Map[f,data],{f,fs[[All,1]]}]; AppendTo[idata,Range[len]]; out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}]; out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1]; out/.Thread[Range[len]->data] , Print["Your specification is incorrect\[Ellipsis]"]; ] ] ClearAll[BinListsBy] BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out}, len=Length[data]; fs={binspecs}; If[AllTrue[fs,2<=Length[#]<=4&], idata=Table[Map[f,data],{f,fs[[All,1]]}]; AppendTo[idata,Range[len]]; out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}]; out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1]; Map[data[[#]]&,out,{Length[fs]}] , Print["Your specification is incorrect\[Ellipsis]"]; ] ] ClearAll[RandomString] RandomString[chars_List,n_Integer]:=StringJoin[RandomChoice[chars,n]] RandomString[n_Integer]:=RandomString[CharacterRange["a","z"],n] Now doing the benchmark: timings=Table[ data=Transpose[{RandomInteger[100,n],Table[RandomString[RandomInteger[{5,10}]],{n}]}]; {{n,AbsoluteTiming[out=BinListsBy[data,{First,0,100,10}];][[1]]}, {n,AbsoluteTiming[out=BinListsByOld[data,{First,0,100,10}];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}]; Show[{ListLogLogPlot[Transpose[timings], Frame -> True, FrameLabel -> {"size", "timing"}, PlotLegends -> {"BinListsBy New", "BinListsBy Old"}], LogLogPlot[{0.000003 x, 0.00000003 x^2}, {x, 10, 10^5}]}] You can see that the new code performs much better actually, moreover, it scales much better: linearly versus quadratically!In order to compare it with 'regular' binning, we can do a problem that can also be done by regular BinLists: timings2=Table[ data=Transpose[{RandomInteger[100,n],RandomReal[100,n]}]; {{n,AbsoluteTiming[BinListsBy[data,{First,0,100,10}];][[1]]}, {n,AbsoluteTiming[BinListsByOld[data,{First,0,100,10}];][[1]]}, {n,AbsoluteTiming[BinLists[data,{0,100,10},{-1000,1000,2000}][[All,1]];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}]; Show[{ListLogLogPlot[Transpose[timings2], Frame -> True, FrameLabel -> {"size", "timing"}, PlotLegends -> {"BinListsBy New", "BinListsBy Old", "BinLists"}], LogLogPlot[{0.000003 x, 0.00000003 x^2, 0.0000017 x}, {x, 10, 10^5}]}] You can see that both the regular BinLists and the new BinListsBy are very close, and both scale linearly (as expected). It is slightly slower than BinLists, but that is expected because it uses BinLists internally + some extra stuff.