Group Abstract Group Abstract

Message Boards Message Boards

9
|
27.5K Views
|
13 Replies
|
51 Total Likes
View groups...
Share
Share this post:

Datasets are very memory hungry (20x!)

Posted 11 years ago

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data).

As each column name is in each line, is't not memory efficient.

See this test:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1;
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

19.9959

I get approximately 20x more memory for it!..

I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that.

Maybe Dataset could have a option TabularData-> True, so:

ds=Dataset[<|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}|>, TabularData-> True]

Here we can check that this structure do not need more memory:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1];
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

1.0007

There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta
13 Replies
Posted 3 years ago
POSTED BY: A Cooper

At the moment in Mathematica 13 the Dataset implementation is still less efficient than we'd like, but we are working on it.

POSTED BY: Taliesin Beynon

Am I missing something, or is the test "compare" actually showing that the Association returned by AssociationThread is the memory-hungry culprit? I don't see Dataset taking up much more memory:

In[23]:= l1 = RandomInteger[1000000, {100000, 4}];
l2 = AssociationThread["Columns" <> ToString@# & /@ Range[4] -> #] & /@
    l1;
ds = Dataset[l1];

In[26]:= ByteCount /@ {l1, l2, ds}

Out[26]= {3200152, 67200080, 3200656}

Yes- the Associations are the space-eaters. I have a few eye tracking datasets where I ended up with a 100x size increase, which, on top of a gigantic-to-begin-with dataset, is quite a problem.

I had some hope that Dataset would do some kind of optimization of the memory used for storing all the redundant Keys... but alas not in this iteration it seems.

POSTED BY: Flip Phillips

Hi Stefan, the point is that there is no way to represent tabular data using Dataset otherwise, but Taliesin and his team are already aware of it. I imagine they are working on something that, when creating a Dataset for tabular data (as using SQL, CSV import or a regular nxm list with heads), the way to handler the data would be exactly as it works in the current designe, but the way it use memory would be as efficient as a regular packed array.

POSTED BY: Rodrigo Murta
POSTED BY: Szabolcs Horvát

The AllowedHeads option in Transpose takes care of this:

Transpose[<|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>, AllowedHeads -> All]

@Stefan AllowedHeads is interesting. Nice undocumented option.

POSTED BY: Rodrigo Murta

It also works with Dimensions:

In[38]:= Dimensions[<|a -> {1, 2}, b -> {3, 4}|>]

Out[38]= {2}

In[39]:= Dimensions[<|a -> {1, 2}, b -> {3, 4}|>, AllowedHeads -> All]

Out[39]= {2, 2}

I believe the design for it wasn't quite ready for M10, which is why it's currently undocumented, but it's there and it works (as far as I know). Please note that like any undocumented symbol, this might change in a future version.

Posted 11 years ago
POSTED BY: Rui Rojo

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data).

As each column name is in each line, is't not memory efficient.

See this test:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1;
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

19.9959

I get approximately 20x more memory for it!..

I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that.

Maybe Dataset could have a option TabularData-> True, so:

ds=Dataset[<|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}|>, TabularData-> True]

Here we can check that this structure do not need more memory:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1];
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

1.0007

There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta

Interesting find. Since it's undocumented, better mention its usage:

Pivot[
 <|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>,
 2
 ]
POSTED BY: Szabolcs Horvát
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard