For me, it's strange the column redundant implementation of Dataset, that do not allow a memory efficient representation of a simpler tabular data format (as SQL data).
As each column name is in each line, is't not memory efficient.
See this test:
compare[lines_,columns_]:=Module[{l1,l2},
l1=RandomInteger[1000000,{lines,columns}];
l2=AssociationThread["Columns"<>ToString@#&/@Range[columns]-> #]&/@l1;
ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]
19.9959
I get approximately 20x more memory for it!..
I know that Dataset is a cool tool for Hierarchical data structure. But I believe that 95% of data problems are still rectangular (all SQL queries are), and the current format of Dataset is very memory inefficient for that.
Maybe Dataset could have a option TabularData-> True, so:
ds=Dataset[<|"col1" -> {1, 2, 3, 4}, "col2" -> {"a", "b", "c", "d"}|>, TabularData-> True]
Here we can check that this structure do not need more memory:
compare[lines_,columns_]:=Module[{l1,l2},
l1=RandomInteger[1000000,{lines,columns}];
l2=Thread["Columns"<>ToString@#&/@Range[columns]-> Transpose@l1];
ByteCount[l2]/ByteCount@Dataset[l1]//N
]
compare[100000,4]
1.0007
There are some Wolfram plans better handler Tabular Data in Dataset?