Group Abstract

Message Boards

WOLFRAM COMMUNITY

16.7K Views

5 Replies

14 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

BinListsBy - Binning data with associate data

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

Note that I changed the code a bit on May 8th, 2017. Now the function is always mapped--independent of the dimensions of the input data. A common problem is that you want to bin data by (e.g.) the x coordinate, while maintaining some associated data y with it. We can define a new function BinListsBy to bin n-dimensional data in m dimensions: ClearAll[BinListsBy] BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out}, len=Length[data]; fs={binspecs}; If[AllTrue[fs, MatchQ[#, {_, _?NumericQ, _?NumericQ} \| {_, _?NumericQ, _?NumericQ, _?NumericQ}] &], idata=Table[Map[f,data],{f,fs[[All,1]]}]; AppendTo[idata,Range[len]]; out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}]; out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1]; Map[data[[#]] &, out, {Length[fs]}] , Print["Your specification is incorrect\[Ellipsis]"]; ] ] BinListsBy::usage=Evaluate[Uncompress@"1:eJztlctqwkAUhn2DvsLpPoUYobelha66qkt1McmcmFPMTHDGSxBfuE/RXKiSRCSOgyi6mTAXPv6f85+cR19+D34fOp0+iS9SWvXT4Xr0/vny0S3Wt2J9Hcx9Fcwo0X25GpYXq+LjOlBuy+fuuHJ6JMOzwOg1GJ7rPW8cWIctgVQFxiRMZNUpbFWhbMbgk1BGFqlCAhKg5kGAStECs53GCc5Kuo96iSis+AYmuBXr4DOFHKQAHSEs2HSOCjiGJLJTP4WYJQmJCYSgJSALIpBh/jQeuV3vHtOzxtQBTnajWqCygnKCJXEd3TN6fRltBVMJBhZ8NTFNa7ksMOZVIzQG8cQpRqFICjbN8yrySmf1tdAA21j9B6gVMzzIbKus6bwGCuWszPLWfz3N54tzeoixcYyEGPVEeohhKKTZWMcL6dWEXMEAsZEeC+G5wSl0IX27GyEX0r8nCrLfxztBtz5sjQryY6EgTUZRiL29mOm8xEGe/Q1ykZxptmeo/wEBvbZm"]; ?BinListsBy Basic Examples Bin 2D data in only 1 dimension: data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}}; out=BinListsBy[data,{First,0,100,10}] out//Grid The data can be of any type: data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}}; BinListsBy[data,{First,-3,10}] Column[%, Frame -> All] will output: {{},{},{{-0.7,ok?}},{},{{1.2,This}},{{2.4,test}},{{3.1,some},{3.14,Sander}},{{4.5,just}},{{5.1,is}},{{6.2,from}},{{7.1,data}},{},{}} such that it is binned using the first element, and the associate data is kept. Bin based on the length of a string: data={"This","is","just","some","test","data","from","Sander","ok?"}; BinListsBy[data,{StringLength,1,10}] giving: {{},{is},{ok?},{This,just,some,test,data,from},{},{Sander},{},{},{}} Of course we can also bin 2D data in a single dimension based on the length of a string: data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}}; BinListsBy[data,{StringLength@Last,1,10}] Column[%,Frame->All] giving: {{},{{5.1,is}},{{-0.7,ok?}},{{1.2,This},{4.5,just},{3.1,some},{2.4,test},{7.1,data},{6.2,from}},{},{{3.14,Sander}},{},{},{}} where now strings of equal length are binned using a {1,10} specification. Scope We can also do higher number of dimensions: data={{1.2,2.3,"This"},{5.1,1.2 ,"is"},{4.5,3.1,"just"},{3.1,1.87,"some"},{2.4,1.6,"test"},{7.1,1.4,"data"},{6.2,7.3,"from"},{3.14,9.1,"Sander"},{-0.7,0.8,"ok?"}}; BinListsBy[data,{First,-2,8},{#[[2]]&,0,10},{StringLength@Last,1,7,1}]; Grid[Map[Column,%,{2}],Frame->All] which nicely outputs a 3D list of binned data. Using Composition and RightComposition or pure functions can make elaborate functionality. Here we bin based on number of unique characters: data={"This","is","just","some","test","data","from","Sander","ok?"}; BinListsBy[data,{Length@DeleteDuplicates@Characters@ToLowerCase,1,10}] giving: {{},{is},{test,data,ok?},{This,just,some,from},{},{Sander},{},{},{}} Note that the data can have different lengths and types or structures: data={{1.3,{1,1}},{2.3,{1,1,1}},{4.14,1,2,"test"},{2.5,{2,1},2},{4.3}}; BinListsBy[data,{First,0,6}]; Column[%] Or by a criteria that works based on aggregate functions (Total, Mean): data={{1,2},{4,5},{-2,5},{3,6},{2,3},{1,1}} BinListsBy[data,{Total,1,10}]; Column[%] Properties & Relations If the data is purely numeric, real, and not ragged, one can specify very large bins for BinLists* to capture all the data in the other dimension: data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}}; BinLists[data,{0,100,10},{-10000,10000,20000}][[All,1]]===BinListsBy[data,{First,0,100,10}] Neat Examples One can bin 1D data in 2D: data={"This","is","just","some","test","data","from","Sander","ok?"}; BinListsBy[data,{Length@DeleteDuplicates@Characters@ToLowerCase,1,10},{StringLength,0,10}]//Grid So we first bin by unique characters, and then by string length: Final thoughts I've been advocating for this functionality for a while, it is basically BinLists* but with an extra By like GatherBy and SortBy such that nD data can be binned in mD (which is currently not possible!). It is indeed very useful for data processing where it is very common to bin by a certain value, while wanting to maintain some auxiliary data (for example an ID of the data, or time-stamp or ...).

Note that I changed the code a bit on May 8th, 2017. Now the function is always mapped--independent of the dimensions of the input data.

A common problem is that you want to bin data by (e.g.) the x coordinate, while maintaining some associated data y with it. We can define a new function BinListsBy to bin n-dimensional data in m dimensions:

ClearAll[BinListsBy]
BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs, MatchQ[#, {_, _?NumericQ, _?NumericQ} | {_, _?NumericQ, _?NumericQ, _?NumericQ}] &],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        Map[data[[#]] &, out, {Length[fs]}]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]        
BinListsBy::usage=Evaluate[Uncompress@"1:eJztlctqwkAUhn2DvsLpPoUYobelha66qkt1McmcmFPMTHDGSxBfuE/RXKiSRCSOgyi6mTAXPv6f85+cR19+D34fOp0+iS9SWvXT4Xr0/vny0S3Wt2J9Hcx9Fcwo0X25GpYXq+LjOlBuy+fuuHJ6JMOzwOg1GJ7rPW8cWIctgVQFxiRMZNUpbFWhbMbgk1BGFqlCAhKg5kGAStECs53GCc5Kuo96iSis+AYmuBXr4DOFHKQAHSEs2HSOCjiGJLJTP4WYJQmJCYSgJSALIpBh/jQeuV3vHtOzxtQBTnajWqCygnKCJXEd3TN6fRltBVMJBhZ8NTFNa7ksMOZVIzQG8cQpRqFICjbN8yrySmf1tdAA21j9B6gVMzzIbKus6bwGCuWszPLWfz3N54tzeoixcYyEGPVEeohhKKTZWMcL6dWEXMEAsZEeC+G5wSl0IX27GyEX0r8nCrLfxztBtz5sjQryY6EgTUZRiL29mOm8xEGe/Q1ykZxptmeo/wEBvbZm"];

?BinListsBy

enter image description here

Basic Examples

Bin 2D data in only 1 dimension:

data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}};
out=BinListsBy[data,{First,0,100,10}]
out//Grid

enter image description here

The data can be of any type:

data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}};
BinListsBy[data,{First,-3,10}]
Column[%, Frame -> All]

will output:

{{},{},{{-0.7,ok?}},{},{{1.2,This}},{{2.4,test}},{{3.1,some},{3.14,Sander}},{{4.5,just}},{{5.1,is}},{{6.2,from}},{{7.1,data}},{},{}}

enter image description here

such that it is binned using the first element, and the associate data is kept.

Bin based on the length of a string:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{StringLength,1,10}]

giving:

{{},{is},{ok?},{This,just,some,test,data,from},{},{Sander},{},{},{}}

Of course we can also bin 2D data in a single dimension based on the length of a string:

data={{1.2,"This"},{5.1,"is"},{4.5,"just"},{3.1,"some"},{2.4,"test"},{7.1,"data"},{6.2,"from"},{3.14,"Sander"},{-0.7,"ok?"}};
BinListsBy[data,{StringLength@*Last,1,10}]
Column[%,Frame->All]

giving:

{{},{{5.1,is}},{{-0.7,ok?}},{{1.2,This},{4.5,just},{3.1,some},{2.4,test},{7.1,data},{6.2,from}},{},{{3.14,Sander}},{},{},{}}

enter image description here

where now strings of equal length are binned using a {1,10} specification.

Scope

We can also do higher number of dimensions:

data={{1.2,2.3,"This"},{5.1,1.2 ,"is"},{4.5,3.1,"just"},{3.1,1.87,"some"},{2.4,1.6,"test"},{7.1,1.4,"data"},{6.2,7.3,"from"},{3.14,9.1,"Sander"},{-0.7,0.8,"ok?"}};
BinListsBy[data,{First,-2,8},{#[[2]]&,0,10},{StringLength@*Last,1,7,1}];
Grid[Map[Column,%,{2}],Frame->All]

enter image description here

which nicely outputs a 3D list of binned data.

Using Composition and RightComposition or pure functions can make elaborate functionality. Here we bin based on number of unique characters:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{Length@*DeleteDuplicates@*Characters@*ToLowerCase,1,10}]

giving:

{{},{is},{test,data,ok?},{This,just,some,from},{},{Sander},{},{},{}}

Note that the data can have different lengths and types or structures:

data={{1.3,{1,1}},{2.3,{1,1,1}},{4.14,1,2,"test"},{2.5,{2,1},2},{4.3}};
BinListsBy[data,{First,0,6}];
Column[%]

enter image description here

Or by a criteria that works based on aggregate functions (Total, Mean):

data={{1,2},{4,5},{-2,5},{3,6},{2,3},{1,1}}
BinListsBy[data,{Total,1,10}];
Column[%]

enter image description here

Properties & Relations

If the data is purely numeric, real, and not ragged, one can specify very large bins for BinLists to capture all the data in the other dimension:

data={{29,48},{76,70},{37,78},{60,63},{83,0},{42,44},{26,77},{59,91},{93,46},{38,24},{30,97},{76,60},{98,50},{35,2},{22,17},{90,90},{90,67},{34,22},{97,26},{78,85},{70,55},{59,92},{15,66},{30,84},{45,48},{91,13},{69,94},{10,100},{97,40},{91,87},{74,24},{66,91},{9,93},{67,2},{44,22},{54,96},{67,13},{23,12},{88,2},{35,82},{76,64},{84,32},{62,5},{56,84},{88,68},{25,99},{41,13},{23,4},{40,14},{89,90}};
BinLists[data,{0,100,10},{-10000,10000,20000}][[All,1]]===BinListsBy[data,{First,0,100,10}]

Neat Examples

One can bin 1D data in 2D:

data={"This","is","just","some","test","data","from","Sander","ok?"};
BinListsBy[data,{Length@*DeleteDuplicates@*Characters@*ToLowerCase,1,10},{StringLength,0,10}]//Grid

So we first bin by unique characters, and then by string length:

enter image description here

Final thoughts

I've been advocating for this functionality for a while, it is basically BinLists but with an extra By like GatherBy and SortBy such that nD data can be binned in mD (which is currently not possible!). It is indeed very useful for data processing where it is very common to bin by a certain value, while wanting to maintain some auxiliary data (for example an ID of the data, or time-stamp or ...).

POSTED BY: Sander Huisman

5 Replies

Sort By:

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 8 years ago

- Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

POSTED BY: EDITORIAL BOARD

Vincent Virgilio

Posted 8 years ago

I've wanted something like this for a long time too. It has been a while since I've needed it, but if I recall correctly, the application was a sliding window of fixed time-width across irregularly sampled data. However, I think recent improvements in the TimeSeries subsystem have addressed my specific need.

POSTED BY: Vincent Virgilio

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

I've used quite often actually, and coded the same thing in multiple ways over the years... It can also be implemented by a GatherBy route; you use GatherBy to do the hard work, and then make bins yourself and replace the values....

POSTED BY: Sander Huisman

Vincent Virgilio

Posted 8 years ago

Again IIRC, performance was an issue. I was trying to apply something like BinListsBy to very long lists. My implementation was never as performant as desired. Have you done any performance assessments on BinListsBy?

POSTED BY: Vincent Virgilio

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

Hi Vincent, Good point, the last version was not optimized for speed. I changed the code slightly above; the old code is included as a comment... This new code is quite a bit faster than the previous one: ClearAll[BinListsByOld] BinListsByOld[data_List,binspecs__List]:=Module[{fs,idata,len,out}, len=Length[data]; fs={binspecs}; If[AllTrue[fs,2<=Length[#]<=4&], idata=Table[Map[f,data],{f,fs[[All,1]]}]; AppendTo[idata,Range[len]]; out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}]; out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1]; out/.Thread[Range[len]->data] , Print["Your specification is incorrect\[Ellipsis]"]; ] ] ClearAll[BinListsBy] BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out}, len=Length[data]; fs={binspecs}; If[AllTrue[fs,2<=Length[#]<=4&], idata=Table[Map[f,data],{f,fs[[All,1]]}]; AppendTo[idata,Range[len]]; out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}]; out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1]; Map[data[[#]]&,out,{Length[fs]}] , Print["Your specification is incorrect\[Ellipsis]"]; ] ] ClearAll[RandomString] RandomString[chars_List,n_Integer]:=StringJoin[RandomChoice[chars,n]] RandomString[n_Integer]:=RandomString[CharacterRange["a","z"],n] Now doing the benchmark: timings=Table[ data=Transpose[{RandomInteger[100,n],Table[RandomString[RandomInteger[{5,10}]],{n}]}]; {{n,AbsoluteTiming[out=BinListsBy[data,{First,0,100,10}];][[1]]}, {n,AbsoluteTiming[out=BinListsByOld[data,{First,0,100,10}];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}]; Show[{ListLogLogPlot[Transpose[timings], Frame -> True, FrameLabel -> {"size", "timing"}, PlotLegends -> {"BinListsBy New", "BinListsBy Old"}], LogLogPlot[{0.000003 x, 0.00000003 x^2}, {x, 10, 10^5}]}] You can see that the new code performs much better actually, moreover, it scales much better: linearly versus quadratically! In order to compare it with 'regular' binning, we can do a problem that can also be done by regular BinLists: timings2=Table[ data=Transpose[{RandomInteger[100,n],RandomReal[100,n]}]; {{n,AbsoluteTiming[BinListsBy[data,{First,0,100,10}];][[1]]}, {n,AbsoluteTiming[BinListsByOld[data,{First,0,100,10}];][[1]]}, {n,AbsoluteTiming[BinLists[data,{0,100,10},{-1000,1000,2000}][[All,1]];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}]; Show[{ListLogLogPlot[Transpose[timings2], Frame -> True, FrameLabel -> {"size", "timing"}, PlotLegends -> {"BinListsBy New", "BinListsBy Old", "BinLists"}], LogLogPlot[{0.000003 x, 0.00000003 x^2, 0.0000017 x}, {x, 10, 10^5}]}] You can see that both the regular BinLists and the new BinListsBy are very close, and both scale linearly (as expected). It is slightly slower than BinLists, but that is expected because it uses BinLists internally + some extra stuff.

Hi Vincent,

Good point, the last version was not optimized for speed. I changed the code slightly above; the old code is included as a comment... This new code is quite a bit faster than the previous one:

ClearAll[BinListsByOld]
BinListsByOld[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs,2<=Length[#]<=4&],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        out/.Thread[Range[len]->data]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]
ClearAll[BinListsBy]
BinListsBy[data_List,binspecs__List]:=Module[{fs,idata,len,out},
    len=Length[data];
    fs={binspecs};
    If[AllTrue[fs,2<=Length[#]<=4&],
        idata=Table[Map[f,data],{f,fs[[All,1]]}];
        AppendTo[idata,Range[len]];
        out=BinLists[Transpose[idata],Sequence@@(Rest/@fs),{0,len+1,len+1}];
        out=Part[out,Sequence@@ConstantArray[All,Length[fs]],1,All,-1];
        Map[data[[#]]&,out,{Length[fs]}]
    ,
        Print["Your specification is incorrect\[Ellipsis]"];
    ]
]
ClearAll[RandomString]
RandomString[chars_List,n_Integer]:=StringJoin[RandomChoice[chars,n]]
RandomString[n_Integer]:=RandomString[CharacterRange["a","z"],n]

Now doing the benchmark:

timings=Table[
data=Transpose[{RandomInteger[100,n],Table[RandomString[RandomInteger[{5,10}]],{n}]}];
{{n,AbsoluteTiming[out=BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[out=BinListsByOld[data,{First,0,100,10}];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];

Show[{ListLogLogPlot[Transpose[timings], Frame -> True, 
   FrameLabel -> {"size", "timing"}, 
   PlotLegends -> {"BinListsBy New", "BinListsBy Old"}], 
  LogLogPlot[{0.000003 x, 0.00000003 x^2}, {x, 10, 10^5}]}]

enter image description here

You can see that the new code performs much better actually, moreover, it scales much better: linearly versus quadratically!

In order to compare it with 'regular' binning, we can do a problem that can also be done by regular BinLists:

timings2=Table[
data=Transpose[{RandomInteger[100,n],RandomReal[100,n]}];
{{n,AbsoluteTiming[BinListsBy[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinListsByOld[data,{First,0,100,10}];][[1]]},
{n,AbsoluteTiming[BinLists[data,{0,100,10},{-1000,1000,2000}][[All,1]];][[1]]}},{n,Round[PowerRange[10,5 10^4,1.2]]}];

Show[{ListLogLogPlot[Transpose[timings2], Frame -> True, 
   FrameLabel -> {"size", "timing"}, 
   PlotLegends -> {"BinListsBy New", "BinListsBy Old", "BinLists"}], 
  LogLogPlot[{0.000003 x, 0.00000003 x^2, 0.0000017 x}, {x, 10, 
    10^5}]}]

enter image description here

You can see that both the regular BinLists and the new BinListsBy are very close, and both scale linearly (as expected). It is slightly slower than BinLists, but that is expected because it uses BinLists internally + some extra stuff.

POSTED BY: Sander Huisman

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback