Group Abstract

Message Boards

WOLFRAM COMMUNITY

27.5K Views

13 Replies

51 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Wolfram Language

Datasets are very memory hungry (20x!)

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data). As each column name is in each line, is't not memory efficient. See this test: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 19.9959 I get approximately 20x more memory for it!.. I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that. Maybe Dataset could have a option TabularData-> True, so: ds=Dataset[<\|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}\|>, TabularData-> True] Here we can check that this structure do not need more memory: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1]; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 1.0007 There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta

13 Replies

Sort By:

A Cooper

Posted 3 years ago

POSTED BY: A Cooper

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 3 years ago

At the moment in Mathematica 13 the `Dataset` implementation is still less efficient than we'd like, but we are working on it.

POSTED BY: Stefan Ragnarsson

Taliesin Beynon

Taliesin Beynon, Wolfram Research

Posted 11 years ago

POSTED BY: Taliesin Beynon

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 11 years ago

Am I missing something, or is the test "compare" actually showing that the Association returned by AssociationThread is the memory-hungry culprit? I don't see Dataset taking up much more memory: In[23]:= l1 = RandomInteger[1000000, {100000, 4}]; l2 = AssociationThread["Columns" <> ToString@# & /@ Range[4] -> #] & /@ l1; ds = Dataset[l1]; In[26]:= ByteCount /@ {l1, l2, ds} Out[26]= {3200152, 67200080, 3200656}

Am I missing something, or is the test "compare" actually showing that the Association returned by AssociationThread is the memory-hungry culprit? I don't see Dataset taking up much more memory:

In[23]:= l1 = RandomInteger[1000000, {100000, 4}];
l2 = AssociationThread["Columns" <> ToString@# & /@ Range[4] -> #] & /@
    l1;
ds = Dataset[l1];

In[26]:= ByteCount /@ {l1, l2, ds}

Out[26]= {3200152, 67200080, 3200656}

POSTED BY: Stefan Ragnarsson

Flip Phillips

Flip Phillips, Rochester Institute of Technology

Posted 11 years ago

Yes- the `Association`s are the space-eaters. I have a few eye tracking datasets where I ended up with a 100x size increase, which, on top of a gigantic-to-begin-with dataset, is quite a problem. I had some hope that `Dataset` would do some kind of optimization of the memory used for storing all the redundant `Key`s... but alas not in this iteration it seems.

POSTED BY: Flip Phillips

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

Hi Stefan, the point is that there is no way to represent tabular data using Dataset otherwise, but Taliesin and his team are already aware of it. I imagine they are working on something that, when creating a Dataset for tabular data (as using SQL, CSV import or a regular nxm list with heads), the way to handler the data would be exactly as it works in the current designe, but the way it use memory would be as efficient as a regular packed array.

POSTED BY: Rodrigo Murta

Szabolcs Horvát

Posted 11 years ago

POSTED BY: Szabolcs Horvát

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 11 years ago

The AllowedHeads option in Transpose takes care of this: Transpose[<\|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}\|>, AllowedHeads -> All]

POSTED BY: Stefan Ragnarsson

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

@Stefan AllowedHeads is interesting. Nice undocumented option.

POSTED BY: Rodrigo Murta

Stefan Ragnarsson

Stefan Ragnarsson, Wolfram Research

Posted 11 years ago

It also works with Dimensions: In[38]:= Dimensions[<\|a -> {1, 2}, b -> {3, 4}\|>] Out[38]= {2} In[39]:= Dimensions[<\|a -> {1, 2}, b -> {3, 4}\|>, AllowedHeads -> All] Out[39]= {2, 2} I believe the design for it wasn't quite ready for M10, which is why it's currently undocumented, but it's there and it works (as far as I know). Please note that like any undocumented symbol, this might change in a future version.

POSTED BY: Stefan Ragnarsson

Rui Rojo

Posted 11 years ago

POSTED BY: Rui Rojo

Rodrigo Murta

Rodrigo Murta, Looqbox

Posted 11 years ago

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data). As each column name is in each line, is't not memory efficient. See this test: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 19.9959 I get approximately 20x more memory for it!.. I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that. Maybe Dataset could have a option TabularData-> True, so: ds=Dataset[<\|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}\|>, TabularData-> True] Here we can check that this structure do not need more memory: compare[lines_,columns_]:=Module[{l1,l2}, l1=RandomInteger[1000000,{lines,columns}]; l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1]; ByteCount[l2]/ByteCount@Dataset[l1]//N ] compare[100000,4] 1.0007 There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta

Szabolcs Horvát

Posted 11 years ago

Interesting find. Since it's undocumented, better mention its usage: Pivot[ <\|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}\|>, 2 ]

POSTED BY: Szabolcs Horvát

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback