Group Abstract Group Abstract

Message Boards Message Boards

2
|
6K Views
|
8 Replies
|
7 Total Likes
View groups...
Share
Share this post:

Efficient representation of data in RAM?

Posted 9 years ago

Hello,

I am new to Mathematica and had a look at how data is represented in memory.

The strange thing with matrices is that as soon as they start containing NaN or missing values their representation in memory seems to change for much worse.

m1 = IdentityMatrix[100]*1.1;
m2 = m1;
m2[[3, 4]] = NaN;
ByteCount[m1]
ByteCount[m2]

I see that the normal matrix occupies about 8 bytes per element, which is expected. But as soon as I put NaN (or Missing) into it the consumption rises up to about 25 bytes per element. Am I missing something or Mathematica does not handle NaN and missing values as efficiently as normal real numbers?

Another thing which I found confusing in how Dataset works.

d1 = Dataset[
   Array[Function[x, <|"a" -> 3.1, "b" -> 4.1|>], 1000000]];
ByteCount[d1]
d2 = Dataset[<|"a" -> Array[Function[x, 3.1], 1000000], 
    "b" -> Array[Function[x, 3.1], 1000000]|>];
ByteCount[d2]

So a dataset with two columns and a million of rows occupies about 215 bytes per element, which is a lot. A dataset with two rows and a million of columns occupies about 8 bytes per element, which is OK. But actually it looks like a dataset with two rows, one column and each cell is a list of 1000000 elements, which can be easily seem on a more compact example:

Dataset[<|"a" -> {1, 2}, "b" -> {3, 4}|>]

Another thing which I do not understand is whether it is possible to restrict the type of elements in columns of Dataset. For example, would it be possible to say that a column "a" must have only Real numbers in it, column "b" must have only text and column "c" must only have items from a specific set, such as { Apple, Pear, Kiwi }.

Please could somebody shed some light upon this?

8 Replies

To add: you can't restrict 'columns' of datasets to have a certain type. Likewise throughout Mathematica; every list or variable can contain any type of variable (string, real, integer, image, whatever)

I agree with Szabolcs that there is no efficient NaN equivalent. But I always found the usage of NaNs (which many people do in e.g. Matlab) to mean 'missing' a very bad habit. Any code should be NaN-less in my opinion.

POSTED BY: Sander Huisman

Also note that:

var = Array[Function[x, 3.1], 1000000];
Developer`PackedArrayQ[var]

var = ConstantArray[3.1, 1000000];
Developer`PackedArrayQ[var]

Array does not (in general) return a packed array. You should use ConstantArray in such a case, which does return a packed array.

POSTED BY: Sander Huisman

I did not intend to use NaN as a replacement for missing items, but NaNs can occasionally happen as a result of a computation. And just like with NaN, putting Missing into a matrix (and probably a list too) immediately makes it occupy a lot of space. That is unexpected to me as standard floating precision numbers have support for NaN, special values like infinity and there is a way to represent missing items too. I hoped that the same can be accessed and used in Mathematica without any overhead.

There is no NaN or equivalent that can be stored in packed arrays. Indeterminate, Infinity, DirectedInfinity, and Missing are your options.

I'm not sure why you would have these 'as a result of a computation'; with all the computations I do (which is quite a bit) I never 'needed' NaNs. I think it is just bad practice in general. divisions by zero (main cause) should be caught and be dealt with accordingly.

Edit: There is a 'NaN' but you need to load a (built-in) package:

Needs["ComputerArithmetic`"]
Developer`PackedArrayQ[Developer`ToPackedArray[{1, 2, 3, NaN}]]

But, as you can see, it can't be used inside a PackedArray. So then one can perhaps better use Indeterminate, Infinity and so on...

POSTED BY: Sander Huisman

Well... You need NaN precisely because of things like division by zero. And this is how you "catch" issues: you just notice that the result of the computation is NaN. This is how standard floating point computations work. I suppose Indeterminate is represented as some kind of NaN when it is in a PackedArray although the same is not done for Matrix.

That is how it is used, but one should prevent it in my opinion. I never make use of NaNs (Indeterminate), and they are really not necessary. If you want to use packed array, you'd better prevent them before doing e.g. division as they can't be packed.

Developer`ToPackedArray will return the original input not a packed version if one tries to pack an indeterminate or so...

POSTED BY: Sander Huisman

This is how standard floating point computations work.

In Mathematica floating point calculations do not work like they do in C or with IEEE floating point values. This is due to Mathematica's roots as a computer algebra system. There are many differences, including the lack of separation between +0 and -0, the fact that there isn't only NaN, +inf, -inf, but a more general DirectedInfinity (try 1/0), etc. Overflow or underflow doesn't result in infinity or zero. Instead "machine numbers" get converted to "arbitrary precision numbers" which can represent much larger or smaller values with any number of digits. The calculation can continue, however, it will be slower.

I suppose Indeterminate is represented as some kind of NaN when it is in a PackedArray although the same is not done for Matrix.

There are multiple misunderstandings here. There is no such thing as Matrix in Mathematica, only nested Lists. If the nested lists have the same structure as an n-dimensional array, then this data structure may become a "packed array". This simply means that it uses the most efficient possible internal presentation.

Packed arrays cannot contains Indeterminate. They can only contain machine numbers. Indeterminate is a symbol. Inserting a symbol unpacks the array.

They also cannot contain arbitrary precision numbers. If such a result appears, it unpacks the array.


Overall I agree with you that there are some cases when the ability to store either missing values, indeterminates, or infinities in packed arrays would be nice. R can handle missing values (NA) without trading off storage efficiency, and this is frequently used by statisticians. But in Mathematica currently it is not possible.

Will this ever be added in the future? Given what I know about how Mathematica works, I highly doubt it. It is my impression that adding this would be a huge and highly risky task, which would also generate a lot of extra work for the future (more special cases will need to be handled by each new function that is added to the language).

POSTED BY: Szabolcs Horvát

Take a look here:

Rectangular arrays containing all machine integers or machine reals can be represented efficiently. These are called packed arrays.

Anything else is a general Mathematica expression. https://reference.wolfram.com/language/tutorial/EverythingIsAnExpression.html

Packed arrays are transparent to the user. They "look" the same as any other array, except they are more efficient.

A packed array cannot contain a symbol. NaN is just an (undefined) symbol. It has nothing to do with floating point NaN.

Is there a way to represent a floating point NaN in Mathematica? I don't think so. Mathematica's arithmetic works differently than IEEE floating point math. Indeterminate comes close, and interacts properly with arithmetic. But it is still a symbol and can't be stored in packed arrays. Packed arrays do not support such values. If you add such a value to a packed array, it gets unpacked behind the scenes.

Missing data is typically represented with Missing[], since version 10. There are a few functions which handle or will return Missing[], so using it is a bit more convenient than using your own symbol.

POSTED BY: Szabolcs Horvát
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard