Message Boards Message Boards

9
|
21816 Views
|
13 Replies
|
51 Total Likes
View groups...
Share
Share this post:

Datasets are very memory hungry (20x!)

Posted 10 years ago

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data).

As each column name is in each line, is't not memory efficient.

See this test:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1;
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

19.9959

I get approximately 20x more memory for it!..

I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that.

Maybe Dataset could have a option TabularData-> True, so:

ds=Dataset[<|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}|>, TabularData-> True]

Here we can check that this structure do not need more memory:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1];
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

1.0007

There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta
13 Replies
Posted 2 years ago

Hi, I'm wondering whether Mathematica 13 has an efficient implementation (for instance, if I create a DataSet from an Array or PackedArray does Mathematica isolate the Association from the actual data).

Thanks! Allan

POSTED BY: A Cooper

At the moment in Mathematica 13 the Dataset implementation is still less efficient than we'd like, but we are working on it.

Hi, I'm the designer and developer of Dataset.

Indeed, "row-oriented" lists of associations are a memory-inefficient way of storing tables.

On the other hand, they are the natural way to do it, conceptually, because each row is meaningful on its own, whereas a single column typically isn't. For example, you can Map pure functions that use #foo and #bar and things will just work. And flexible schemas are possible, whereas they are not with "column-oriented" associations of lists.

And remember that Dataset can store any form of hierarchical data, so we can't just force column-oriented semantics on the user and call it a day, as say R's dataframes or Pandas does.

But it should be possible to have the logical row-oriented model map onto a column-oriented physical implementation. And indeed I've spent a lot of time and effort on the type system that underlies dataset to make it possible to do this. So for 10.0.1 or 10.0.2 I'm hoping we can have the best of both worlds: efficient packing into memory of whatever logical schema you happen to have.

POSTED BY: Taliesin Beynon

Am I missing something, or is the test "compare" actually showing that the Association returned by AssociationThread is the memory-hungry culprit? I don't see Dataset taking up much more memory:

In[23]:= l1 = RandomInteger[1000000, {100000, 4}];
l2 = AssociationThread["Columns" <> ToString@# & /@ Range[4] -> #] & /@
    l1;
ds = Dataset[l1];

In[26]:= ByteCount /@ {l1, l2, ds}

Out[26]= {3200152, 67200080, 3200656}

Yes- the Associations are the space-eaters. I have a few eye tracking datasets where I ended up with a 100x size increase, which, on top of a gigantic-to-begin-with dataset, is quite a problem.

I had some hope that Dataset would do some kind of optimization of the memory used for storing all the redundant Keys... but alas not in this iteration it seems.

POSTED BY: Flip Phillips

Hi Stefan, the point is that there is no way to represent tabular data using Dataset otherwise, but Taliesin and his team are already aware of it. I imagine they are working on something that, when creating a Dataset for tabular data (as using SQL, CSV import or a regular nxm list with heads), the way to handler the data would be exactly as it works in the current designe, but the way it use memory would be as efficient as a regular packed array.

POSTED BY: Rodrigo Murta

A related question (I'm not sure if it's best to keep it here or to start a new thread):

  • How to efficiently "transpose" such a structure in either direction?

To clarify, I mean "transposition" between these two types of structures:

<|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>

and

{<|"col1" -> 1, "col2" -> x|>,
 <|"col1" -> 2, "col2" -> y|>,
 <|"col1" -> 3, "col2" -> z|>}

These types of questions are already coming up e.g. here.

POSTED BY: Szabolcs Horvát

The AllowedHeads option in Transpose takes care of this:

Transpose[<|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>, AllowedHeads -> All]

@Stefan AllowedHeads is interesting. Nice undocumented option.

POSTED BY: Rodrigo Murta

It also works with Dimensions:

In[38]:= Dimensions[<|a -> {1, 2}, b -> {3, 4}|>]

Out[38]= {2}

In[39]:= Dimensions[<|a -> {1, 2}, b -> {3, 4}|>, AllowedHeads -> All]

Out[39]= {2, 2}

I believe the design for it wasn't quite ready for M10, which is why it's currently undocumented, but it's there and it works (as far as I know). Please note that like any undocumented symbol, this might change in a future version.

Posted 10 years ago

Check Pivot

POSTED BY: Rui Rojo

For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data).

As each column name is in each line, is't not memory efficient.

See this test:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1;
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

19.9959

I get approximately 20x more memory for it!..

I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that.

Maybe Dataset could have a option TabularData-> True, so:

ds=Dataset[<|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}|>, TabularData-> True]

Here we can check that this structure do not need more memory:

compare[lines_,columns_]:=Module[{l1,l2},
    l1=RandomInteger[1000000,{lines,columns}];
    l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1];
    ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]

1.0007

There are some Wolfram plans better handler Tabular Data in Dataset?

POSTED BY: Rodrigo Murta

Interesting find. Since it's undocumented, better mention its usage:

Pivot[
 <|"col1" -> {1, 2, 3}, "col2" -> {x, y, z}|>,
 2
 ]
POSTED BY: Szabolcs Horvát
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract