Group Abstract Group Abstract

Message Boards Message Boards

Datasets are very memory hungry (20x!)

GROUPS:

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data).

As each column name is in each line, is't not memory efficient.

See this test:

compare[lines_,columns_]:=Module[{l1,l2},
l1=RandomInteger[1000000,{lines,columns}];
l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1;
ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

19.9959

I get approximately 20x more memory for it!..

I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that.

Maybe Dataset could have a option TabularData-> True, so:

ds=Dataset[<|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}|>, TabularData-> True]

Here we can check that this structure do not need more memory:

compare[lines_,columns_]:=Module[{l1,l2},
l1=RandomInteger[1000000,{lines,columns}];
l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1];
ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

1.0007

There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta
Answer
4 months ago

A related question (I'm not sure if it's best to keep it here or to start a new thread):

  • How to efficiently "transpose" such a structure in either direction?

To clarify, I mean "transposition" between these two types of structures:

<|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>

and

{<|"col1" -> 1, "col2" -> x|>,
<|"col1" -> 2, "col2" -> y|>,
<|"col1" -> 3, "col2" -> z|>}

These types of questions are already coming up e.g. [here](http://mathematica.stackexchange.com/q/54490/12).

POSTED BY: Szabolcs Horvat
Answer
4 months ago

The AllowedHeads option in Transpose takes care of this:

Transpose[<|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>, AllowedHeads -> All]
POSTED BY: Stefan Ragnarsson
Answer
4 months ago

@Stefan AllowedHeads is interesting. Nice undocumented option.

POSTED BY: Rodrigo Murta
Answer
4 months ago

It also works with Dimensions:

In[38]:= Dimensions[<|a -> {1, 2}, b -> {3, 4}|>]

Out[38]= {2}

In[39]:= Dimensions[<|a -> {1, 2}, b -> {3, 4}|>, AllowedHeads -> All]

Out[39]= {2, 2}

I believe the design for it wasn't quite ready for M10, which is why it's currently undocumented, but it's there and it works (as far as I know). Please note that like any undocumented symbol, this might change in a future version.

POSTED BY: Stefan Ragnarsson
Answer
4 months ago

Check Pivot

POSTED BY: Rui Rojo
Answer
3 months ago

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data).

As each column name is in each line, is't not memory efficient.

See this test:

compare[lines_,columns_]:=Module[{l1,l2},
l1=RandomInteger[1000000,{lines,columns}];
l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1;
ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

19.9959

I get approximately 20x more memory for it!..

I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that.

Maybe Dataset could have a option TabularData-> True, so:

ds=Dataset[<|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}|>, TabularData-> True]

Here we can check that this structure do not need more memory:

compare[lines_,columns_]:=Module[{l1,l2},
l1=RandomInteger[1000000,{lines,columns}];
l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1];
ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

1.0007

There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta
Answer
3 months ago

Interesting find. Since it's undocumented, better mention its usage:

Pivot[
<|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>,
2
]
POSTED BY: Szabolcs Horvat
Answer
3 months ago

Am I missing something, or is the test "compare" actually showing that the Association returned by AssociationThread is the memory-hungry culprit? I don't see Dataset taking up much more memory:

In[23]:= l1 = RandomInteger[1000000, {100000, 4}];
l2 = AssociationThread["Columns" <> ToString@# & /@ Range[4] -> #] & /@
l1;
ds = Dataset[l1];

In[26]:= ByteCount /@ {l1, l2, ds}

Out[26]= {3200152, 67200080, 3200656}
POSTED BY: Stefan Ragnarsson
Answer
4 months ago

Yes- the Associations are the space-eaters. I have a few eye tracking datasets where I ended up with a 100x size increase, which, on top of a gigantic-to-begin-with dataset, is quite a problem.

I had some hope that Dataset would do some kind of optimization of the memory used for storing all the redundant Keys... but alas not in this iteration it seems.

POSTED BY: Flip Phillips
Answer
4 months ago

Hi Stefan, the point is that there is no way to represent tabular data using Dataset otherwise, but Taliesin and his team are already aware of it. I imagine they are working on something that, when creating a Dataset for tabular data (as using SQL, CSV import or a regular nxm list with heads), the way to handler the data would be exactly as it works in the current designe, but the way it use memory would be as efficient as a regular packed array.

POSTED BY: Rodrigo Murta
Answer
3 months ago

Hi, I'm the designer and developer of Dataset.

Indeed, "row-oriented" lists of associations are a memory-inefficient way of storing tables.

On the other hand, they are the natural way to do it, conceptually, because each row is meaningful on its own, whereas a single column typically isn't. For example, you can Map pure functions that use #foo and #bar and things will just work. And flexible schemas are possible, whereas they are not with "column-oriented" associations of lists.

And remember that Dataset can store any form of hierarchical data, so we can't just force column-oriented semantics on the user and call it a day, as say R's dataframes or Pandas does.

But it should be possible to have the logical row-oriented model map onto a column-oriented physical implementation. And indeed I've spent a lot of time and effort on the type system that underlies dataset to make it possible to do this. So for 10.0.1 or 10.0.2 I'm hoping we can have the best of both worlds: efficient packing into memory of whatever logical schema you happen to have.

POSTED BY: Taliesin Beynon
Answer
4 months ago