Message Boards

WOLFRAM COMMUNITY

24324 Views

13 Replies

51 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Wolfram Language

Datasets are very memory hungry (20x!)

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data). As each column name is in each line, is't not memory efficient. See this test: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 19.9959 I get approximately 20x more memory for it!.. I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that. Maybe Dataset could have a option TabularData-> True, so: ds=Dataset[<\|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}\|>, TabularData-> True] Here we can check that this structure do not need more memory: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1]; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 1.0007 There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta

13 Replies

Sort By:

A Cooper

Posted 3 years ago

Hi, I'm wondering whether Mathematica 13 has an efficient implementation (for instance, if I create a DataSet from an Array or PackedArray does Mathematica isolate the Association from the actual data). Thanks! Allan

POSTED BY: A Cooper

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 3 years ago

At the moment in Mathematica 13 the `Dataset` implementation is still less efficient than we'd like, but we are working on it.

POSTED BY: Stefan Ragnarsson

Taliesin Beynon

Taliesin Beynon, Wolfram Research

Posted 11 years ago

Hi, I'm the designer and developer of Dataset. Indeed, "row-oriented" lists of associations are a memory-inefficient way of storing tables. On the other hand, they are the natural way to do it, conceptually, because each row is meaningful on its own, whereas a single column typically isn't. For example, you can Map pure functions that use #foo and #bar and things will just work. And flexible schemas are possible, whereas they are not with "column-oriented" associations of lists. And remember that Dataset can store any form of hierarchical data, so we can't just force column-oriented semantics on the user and call it a day, as say R's dataframes or Pandas does. But it should be possible to have the logical row-oriented model map onto a column-oriented physical implementation. And indeed I've spent a lot of time and effort on the type system that underlies dataset to make it possible to do this. So for 10.0.1 or 10.0.2 I'm hoping we can have the best of both worlds: efficient packing into memory of whatever logical schema you happen to have.

POSTED BY: Taliesin Beynon

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 11 years ago

Am I missing something, or is the test "compare" actually showing that the Association returned by AssociationThread is the memory-hungry culprit? I don't see Dataset taking up much more memory: In[23]:= l1 = RandomInteger[1000000, {100000, 4}]; l2 = AssociationThread["Columns" <> ToString@# & /@ Range[4] -> #] & /@ l1; ds = Dataset[l1]; In[26]:= ByteCount /@ {l1, l2, ds} Out[26]= {3200152, 67200080, 3200656}

Am I missing something, or is the test "compare" actually showing that the Association returned by AssociationThread is the memory-hungry culprit? I don't see Dataset taking up much more memory:

In[23]:= l1 = RandomInteger[1000000, {100000, 4}];
l2 = AssociationThread["Columns" <> ToString@# & /@ Range[4] -> #] & /@
    l1;
ds = Dataset[l1];

In[26]:= ByteCount /@ {l1, l2, ds}

Out[26]= {3200152, 67200080, 3200656}

POSTED BY: Stefan Ragnarsson

Flip Phillips

Flip Phillips, Rochester Institute of Technology

Posted 11 years ago

Yes- the `Association`s are the space-eaters. I have a few eye tracking datasets where I ended up with a 100x size increase, which, on top of a gigantic-to-begin-with dataset, is quite a problem. I had some hope that `Dataset` would do some kind of optimization of the memory used for storing all the redundant `Key`s... but alas not in this iteration it seems.

POSTED BY: Flip Phillips

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

Hi Stefan, the point is that there is no way to represent tabular data using Dataset otherwise, but Taliesin and his team are already aware of it. I imagine they are working on something that, when creating a Dataset for tabular data (as using SQL, CSV import or a regular nxm list with heads), the way to handler the data would be exactly as it works in the current designe, but the way it use memory would be as efficient as a regular packed array.

POSTED BY: Rodrigo Murta

Szabolcs Horvát

Posted 11 years ago

A related question (I'm not sure if it's best to keep it here or to start a new thread): How to efficiently "transpose" such a structure in either direction? To clarify, I mean "transposition" between these two types of structures: <\|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}\|> and {<\|"col1" -> 1, "col2" -> x\|>, <\|"col1" -> 2, "col2" -> y\|>, <\|"col1" -> 3, "col2" -> z\|>} These types of questions are already coming up e.g. here.

POSTED BY: Szabolcs Horvát

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 11 years ago

The AllowedHeads option in Transpose takes care of this: Transpose[<\|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}\|>, AllowedHeads -> All]

POSTED BY: Stefan Ragnarsson

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

@Stefan AllowedHeads is interesting. Nice undocumented option.

POSTED BY: Rodrigo Murta

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 11 years ago

It also works with Dimensions: In[38]:= Dimensions[<\|a -> {1, 2}, b -> {3, 4}\|>] Out[38]= {2} In[39]:= Dimensions[<\|a -> {1, 2}, b -> {3, 4}\|>, AllowedHeads -> All] Out[39]= {2, 2} I believe the design for it wasn't quite ready for M10, which is why it's currently undocumented, but it's there and it works (as far as I know). Please note that like any undocumented symbol, this might change in a future version.

POSTED BY: Stefan Ragnarsson

Rui Rojo

Posted 11 years ago

Check `Pivot`

POSTED BY: Rui Rojo

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data). As each column name is in each line, is't not memory efficient. See this test: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 19.9959 I get approximately 20x more memory for it!.. I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that. Maybe Dataset could have a option TabularData-> True, so: ds=Dataset[<\|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}\|>, TabularData-> True] Here we can check that this structure do not need more memory: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1]; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 1.0007 There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta

Szabolcs Horvát

Posted 11 years ago

Interesting find. Since it's undocumented, better mention its usage: Pivot[ <\|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}\|>, 2 ]

POSTED BY: Szabolcs Horvát

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback