Hello,
I am new to Mathematica and had a look at how data is represented in memory.
The strange thing with matrices is that as soon as they start containing NaN or missing values their representation in memory seems to change for much worse.
m1 = IdentityMatrix[100]*1.1;
m2 = m1;
m2[[3, 4]] = NaN;
ByteCount[m1]
ByteCount[m2]
I see that the normal matrix occupies about 8 bytes per element, which is expected. But as soon as I put NaN (or Missing) into it the consumption rises up to about 25 bytes per element. Am I missing something or Mathematica does not handle NaN and missing values as efficiently as normal real numbers?
Another thing which I found confusing in how Dataset works.
d1 = Dataset[
Array[Function[x, <|"a" -> 3.1, "b" -> 4.1|>], 1000000]];
ByteCount[d1]
d2 = Dataset[<|"a" -> Array[Function[x, 3.1], 1000000],
"b" -> Array[Function[x, 3.1], 1000000]|>];
ByteCount[d2]
So a dataset with two columns and a million of rows occupies about 215 bytes per element, which is a lot. A dataset with two rows and a million of columns occupies about 8 bytes per element, which is OK. But actually it looks like a dataset with two rows, one column and each cell is a list of 1000000 elements, which can be easily seem on a more compact example:
Dataset[<|"a" -> {1, 2}, "b" -> {3, 4}|>]
Another thing which I do not understand is whether it is possible to restrict the type of elements in columns of Dataset. For example, would it be possible to say that a column "a" must have only Real numbers in it, column "b" must have only text and column "c" must only have items from a specific set, such as { Apple, Pear, Kiwi }.
Please could somebody shed some light upon this?