Group Abstract Group Abstract

Message Boards Message Boards

2
|
6K Views
|
8 Replies
|
7 Total Likes
View groups...
Share
Share this post:

Efficient representation of data in RAM?

Posted 9 years ago

Hello,

I am new to Mathematica and had a look at how data is represented in memory.

The strange thing with matrices is that as soon as they start containing NaN or missing values their representation in memory seems to change for much worse.

m1 = IdentityMatrix[100]*1.1;
m2 = m1;
m2[[3, 4]] = NaN;
ByteCount[m1]
ByteCount[m2]

I see that the normal matrix occupies about 8 bytes per element, which is expected. But as soon as I put NaN (or Missing) into it the consumption rises up to about 25 bytes per element. Am I missing something or Mathematica does not handle NaN and missing values as efficiently as normal real numbers?

Another thing which I found confusing in how Dataset works.

d1 = Dataset[
   Array[Function[x, <|"a" -> 3.1, "b" -> 4.1|>], 1000000]];
ByteCount[d1]
d2 = Dataset[<|"a" -> Array[Function[x, 3.1], 1000000], 
    "b" -> Array[Function[x, 3.1], 1000000]|>];
ByteCount[d2]

So a dataset with two columns and a million of rows occupies about 215 bytes per element, which is a lot. A dataset with two rows and a million of columns occupies about 8 bytes per element, which is OK. But actually it looks like a dataset with two rows, one column and each cell is a list of 1000000 elements, which can be easily seem on a more compact example:

Dataset[<|"a" -> {1, 2}, "b" -> {3, 4}|>]

Another thing which I do not understand is whether it is possible to restrict the type of elements in columns of Dataset. For example, would it be possible to say that a column "a" must have only Real numbers in it, column "b" must have only text and column "c" must only have items from a specific set, such as { Apple, Pear, Kiwi }.

Please could somebody shed some light upon this?

8 Replies
POSTED BY: Szabolcs Horvát

Take a look here:

Rectangular arrays containing all machine integers or machine reals can be represented efficiently. These are called packed arrays.

Anything else is a general Mathematica expression. https://reference.wolfram.com/language/tutorial/EverythingIsAnExpression.html

Packed arrays are transparent to the user. They "look" the same as any other array, except they are more efficient.

A packed array cannot contain a symbol. NaN is just an (undefined) symbol. It has nothing to do with floating point NaN.

Is there a way to represent a floating point NaN in Mathematica? I don't think so. Mathematica's arithmetic works differently than IEEE floating point math. Indeterminate comes close, and interacts properly with arithmetic. But it is still a symbol and can't be stored in packed arrays. Packed arrays do not support such values. If you add such a value to a packed array, it gets unpacked behind the scenes.

Missing data is typically represented with Missing[], since version 10. There are a few functions which handle or will return Missing[], so using it is a bit more convenient than using your own symbol.

POSTED BY: Szabolcs Horvát

That is how it is used, but one should prevent it in my opinion. I never make use of NaNs (Indeterminate), and they are really not necessary. If you want to use packed array, you'd better prevent them before doing e.g. division as they can't be packed.

Developer`ToPackedArray will return the original input not a packed version if one tries to pack an indeterminate or so...

POSTED BY: Sander Huisman

Well... You need NaN precisely because of things like division by zero. And this is how you "catch" issues: you just notice that the result of the computation is NaN. This is how standard floating point computations work. I suppose Indeterminate is represented as some kind of NaN when it is in a PackedArray although the same is not done for Matrix.

There is no NaN or equivalent that can be stored in packed arrays. Indeterminate, Infinity, DirectedInfinity, and Missing are your options.

I'm not sure why you would have these 'as a result of a computation'; with all the computations I do (which is quite a bit) I never 'needed' NaNs. I think it is just bad practice in general. divisions by zero (main cause) should be caught and be dealt with accordingly.

Edit: There is a 'NaN' but you need to load a (built-in) package:

Needs["ComputerArithmetic`"]
Developer`PackedArrayQ[Developer`ToPackedArray[{1, 2, 3, NaN}]]

But, as you can see, it can't be used inside a PackedArray. So then one can perhaps better use Indeterminate, Infinity and so on...

POSTED BY: Sander Huisman

I did not intend to use NaN as a replacement for missing items, but NaNs can occasionally happen as a result of a computation. And just like with NaN, putting Missing into a matrix (and probably a list too) immediately makes it occupy a lot of space. That is unexpected to me as standard floating precision numbers have support for NaN, special values like infinity and there is a way to represent missing items too. I hoped that the same can be accessed and used in Mathematica without any overhead.

Also note that:

var = Array[Function[x, 3.1], 1000000];
Developer`PackedArrayQ[var]

var = ConstantArray[3.1, 1000000];
Developer`PackedArrayQ[var]

Array does not (in general) return a packed array. You should use ConstantArray in such a case, which does return a packed array.

POSTED BY: Sander Huisman

To add: you can't restrict 'columns' of datasets to have a certain type. Likewise throughout Mathematica; every list or variable can contain any type of variable (string, real, integer, image, whatever)

I agree with Szabolcs that there is no efficient NaN equivalent. But I always found the usage of NaNs (which many people do in e.g. Matlab) to mean 'missing' a very bad habit. Any code should be NaN-less in my opinion.

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard