Message Boards Message Boards

9
|
15127 Views
|
5 Replies
|
14 Total Likes
View groups...
Share
Share this post:

BinListsBy - Binning data with associate data

Note that I changed the code a bit on May 8th, 2017. Now the function is always mapped--independent of the dimensions of the input data.

A common problem is that you want to bin data by (e.g.) the x coordinate, while maintaining some associated data y with it. We can define a new function BinListsBy to bin n-dimensional data in m dimensions:

ClearAll[BinListsBy]
BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs, MatchQ[#, {_, _?NumericQ, _?NumericQ} | {_, _?NumericQ, _?NumericQ, _?NumericQ}] &],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        Map[data[[#]] &, out, {Length[fs]}]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]        
BinListsBy::usage=Evaluate[Uncompress@"1:eJztlctqwkAUhn2DvsLpPoUYobelha66qkt1McmcmFPMTHDGSxBfuE/RXKiSRCSOgyi6mTAXPv6f85+cR19+D34fOp0+iS9SWvXT4Xr0/vny0S3Wt2J9Hcx9Fcwo0X25GpYXq+LjOlBuy+fuuHJ6JMOzwOg1GJ7rPW8cWIctgVQFxiRMZNUpbFWhbMbgk1BGFqlCAhKg5kGAStECs53GCc5Kuo96iSis+AYmuBXr4DOFHKQAHSEs2HSOCjiGJLJTP4WYJQmJCYSgJSALIpBh/jQeuV3vHtOzxtQBTnajWqCygnKCJXEd3TN6fRltBVMJBhZ8NTFNa7ksMOZVIzQG8cQpRqFICjbN8yrySmf1tdAA21j9B6gVMzzIbKus6bwGCuWszPLWfz3N54tzeoixcYyEGPVEeohhKKTZWMcL6dWEXMEAsZEeC+G5wSl0IX27GyEX0r8nCrLfxztBtz5sjQryY6EgTUZRiL29mOm8xEGe/Q1ykZxptmeo/wEBvbZm"];

?BinListsBy

enter image description here

Basic Examples

Bin 2D data in only 1 dimension:

data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}};
out=BinListsBy[data,{First,0,100,10}]
out//Grid

enter image description here

The data can be of any type:

data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}};
BinListsBy[data,{First,-3,10}]
Column[%, Frame -> All]

will output:

{{},{},{{-0.7,ok?}},{},{{1.2,This}},{{2.4,test}},{{3.1,some},{3.14,Sander}},{{4.5,just}},{{5.1,is}},{{6.2,from}},{{7.1,data}},{},{}}

enter image description here

such that it is binned using the first element, and the associate data is kept.

Bin based on the length of a string:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{StringLength,1,10}]

giving:

{{},{is},{ok?},{This,just,some,test,data,from},{},{Sander},{},{},{}}

Of course we can also bin 2D data in a single dimension based on the length of a string:

data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}};
BinListsBy[data,{StringLength@*Last,1,10}]
Column[%,Frame->All]

giving:

{{},{{5.1,is}},{{-0.7,ok?}},{{1.2,This},{4.5,just},{3.1,some},{2.4,test},{7.1,data},{6.2,from}},{},{{3.14,Sander}},{},{},{}}

enter image description here

where now strings of equal length are binned using a {1,10} specification.

Scope

We can also do higher number of dimensions:

data={{1.2,2.3,"This"},{5.1,1.2 ,"is"},{4.5,3.1,"just"},{3.1,1.87,"some"},{2.4,1.6,"test"},{7.1,1.4,"data"},{6.2,7.3,"from"},{3.14,9.1,"Sander"},{-0.7,0.8,"ok?"}};
BinListsBy[data,{First,-2,8},{#[[2]]&,0,10},{StringLength@*Last,1,7,1}];
Grid[Map[Column,%,{2}],Frame->All]

enter image description here

which nicely outputs a 3D list of binned data.

Using Composition and RightComposition or pure functions can make elaborate functionality. Here we bin based on number of unique characters:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{Length@*DeleteDuplicates@*Characters@*ToLowerCase,1,10}]

giving:

{{},{is},{test,data,ok?},{This,just,some,from},{},{Sander},{},{},{}}

Note that the data can have different lengths and types or structures:

data={{1.3,{1,1}},{2.3,{1,1,1}},{4.14,1,2,"test"},{2.5,{2,1},2},{4.3}};
BinListsBy[data,{First,0,6}];
Column[%]

enter image description here

Or by a criteria that works based on aggregate functions (Total, Mean):

data={{1,2},{4,5},{-2,5},{3,6},{2,3},{1,1}}
BinListsBy[data,{Total,1,10}];
Column[%]

enter image description here

Properties & Relations

If the data is purely numeric, real, and not ragged, one can specify very large bins for BinLists to capture all the data in the other dimension:

data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}};
BinLists[data,{0,100,10},{-10000,10000,20000}][[All,1]]===BinListsBy[data,{First,0,100,10}]

Neat Examples

One can bin 1D data in 2D:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{Length@*DeleteDuplicates@*Characters@*ToLowerCase,1,10},{StringLength,0,10}]//Grid

So we first bin by unique characters, and then by string length:

enter image description here

Final thoughts

I've been advocating for this functionality for a while, it is basically BinLists but with an extra By like GatherBy and SortBy such that nD data can be binned in mD (which is currently not possible!). It is indeed very useful for data processing where it is very common to bin by a certain value, while wanting to maintain some auxiliary data (for example an ID of the data, or time-stamp or ...).

POSTED BY: Sander Huisman
5 Replies

enter image description here - Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

POSTED BY: EDITORIAL BOARD

I've wanted something like this for a long time too. It has been a while since I've needed it, but if I recall correctly, the application was a sliding window of fixed time-width across irregularly sampled data.

However, I think recent improvements in the TimeSeries subsystem have addressed my specific need.

POSTED BY: Vincent Virgilio

I've used quite often actually, and coded the same thing in multiple ways over the years... It can also be implemented by a GatherBy route; you use GatherBy to do the hard work, and then make bins yourself and replace the values....

POSTED BY: Sander Huisman

Again IIRC, performance was an issue. I was trying to apply something like BinListsBy to very long lists. My implementation was never as performant as desired.

Have you done any performance assessments on BinListsBy?

POSTED BY: Vincent Virgilio

Hi Vincent,

Good point, the last version was not optimized for speed. I changed the code slightly above; the old code is included as a comment... This new code is quite a bit faster than the previous one:

ClearAll[BinListsByOld]
BinListsByOld[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs,2<=Length[#]<=4&],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        out/.Thread[Range[len]->data]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]
ClearAll[BinListsBy]
BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs,2<=Length[#]<=4&],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        Map[data[[#]]&,out,{Length[fs]}]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]
ClearAll[RandomString]
RandomString[chars_List,n_Integer]:=StringJoin[RandomChoice[chars,n]]
RandomString[n_Integer]:=RandomString[CharacterRange["a","z"],n]

Now doing the benchmark:

timings=Table[
data=Transpose[{RandomInteger[100,n],Table[RandomString[RandomInteger[{5,10}]],{n}]}];
{{n,AbsoluteTiming[out=BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[out=BinListsByOld[data,{First,0,100,10}];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];

Show[{ListLogLogPlot[Transpose[timings], Frame -> True, 
   FrameLabel -> {"size", "timing"}, 
   PlotLegends -> {"BinListsBy New", "BinListsBy Old"}], 
  LogLogPlot[{0.000003 x, 0.00000003 x^2}, {x, 10, 10^5}]}]

enter image description here

You can see that the new code performs much better actually, moreover, it scales much better: linearly versus quadratically!

In order to compare it with 'regular' binning, we can do a problem that can also be done by regular BinLists:

timings2=Table[
data=Transpose[{RandomInteger[100,n],RandomReal[100,n]}];
{{n,AbsoluteTiming[BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinListsByOld[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinLists[data,{0,100,10},{-1000,1000,2000}][[All,1]];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];

Show[{ListLogLogPlot[Transpose[timings2], Frame -> True, 
   FrameLabel -> {"size", "timing"}, 
   PlotLegends -> {"BinListsBy New", "BinListsBy Old", "BinLists"}], 
  LogLogPlot[{0.000003 x, 0.00000003 x^2, 0.0000017 x}, {x, 10, 
    10^5}]}]

enter image description here

You can see that both the regular BinLists and the new BinListsBy are very close, and both scale linearly (as expected). It is slightly slower than BinLists, but that is expected because it uses BinLists internally + some extra stuff.

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract