Group Abstract Group Abstract

Message Boards Message Boards

6
|
25.1K Views
|
14 Replies
|
28 Total Likes
View groups...
Share
Share this post:

[?] What is the intended purpose of ByteArray & how can we use/convert it?

Posted 9 years ago

I was wondering what the intended purpose of the ByteArray type was. The cryptography functionality seems to be using it. And in version 11.1 we have BinarySerialize, which people are also a bit confused about (including myself, so consider that function included in this question as well).

The most straightforward guess about BinaryArray is that it is a space-efficient and consistent way to store binary data. We could use a list of integers, but that is not space efficient (each takes at least 8 bytes) and the 0..255 range is not enforced.

If such a space efficient data type is to be useful, it should be possible to convert/transfer it without the overhead of an inefficient integer-list intermediate representation.

How can we convert/transfer ByteArray to/from:

  • Files. Is there a function like BinaryReadList to handle it?

  • LibraryLink. Can I transfer a byte array efficiently to C? Can I convert it to a byte-type RawArray (which is already supported by LibraryLink)?

  • Strings. String are sometimes used to represent the contents of files, or binary data, in a byte-perfect way. We have ImportSting/ExportString for this reason. Strings are not as good for this purpose as a real byte array because each character takes 2 bytes (and hopefully this will change in the future to allow for things beyond the basic multilingual plane in Unicode)

  • Base64 encoded data in strings. This is how ByteArrays show up in InputForm, though the documentation suggests that they are stored more efficient internally. Such a string can be converted to a ByteArray using Developer`DecodeBase64ToByteArray. What about the reverse conversion?

It would also be nice to have an equivalent of StringToStream for ByteArrays.

If some of the above are not possible, please consider them a feature request. Regarding reading/writing from/to files, a lightweight function would be preferred (as opposed to the heavyweight, high overhead Import/Export which cannot even be used during initialization, i.e. in init.m)

What can we do with ByteArrays other than use them with the cryptography functions?

The documentation mentions that we can use Part, First, Last, Min, Max.

By experimentation, Take, Drop, Length, Dimensions, Rest, Most also work.

So do BitAnd, BitOr, etc.

HTTPRequestData and related functions support the property "BodyByteArray"

Is there anything else?

POSTED BY: Szabolcs Horvát
14 Replies

Thanks for the response Dorian! In the meantime I also got a response on StackExchange, which pointed out that the type specification "ByteArray" can be used in LibraryFunctionLoad. In C code, it can be treated as a byte-type rank-1 RawArray. In Mathematica it will be a ByteArray. Thus one can write a simple library function that just returns a RawArray that was passed to it, but load it as LibraryFunctionLoad[..., {"RawArray"}, "ByteArray"].

I am looking forward to all this functionality becoming documented and brought to completion!

POSTED BY: Szabolcs Horvát

You can't directly create a ByteArray from a RawArray. But, I see no reason not to support it, since we already have ByteArray from PackedArray.

POSTED BY: Dorian Birraux

Hi @Itai Seggev and @Dorian Birraux,

There seem to be basically three different efficient representations of byte sequences: strings, byte arrays, and byte-type rank-1 RawArrays.

Only RawArrays can be exchanged with C code nicely. (Strings are supported in LibraryLink, but handling them is cumbersome, and they are assumed to be null-terminated.)

I found that a ByteArray can be converted to a RawArray:

In[4]:= ba = ByteArray[Range[10]];

In[5]:= RawArray["Byte", ba]
Out[5]= RawArray["UnsignedInteger8", "<" 10 ">"]

What about the reverse? How can I convert a rank-1 byte-type RawArray into a ByteArray without unpacking it first into a list of 64-bit machine integers (and blow up the storage requirements 8-fold)?

POSTED BY: Szabolcs Horvát
Posted 8 years ago

Please do not forget about Unicode in file paths:

POSTED BY: Alexey Popkov

Thanks for the comments! It is encouraging that work is being done towards this goal.

POSTED BY: Szabolcs Horvát
Posted 9 years ago

The kernel used to use UCS-2 internally. It now uses a variant of UTF-8. MathLink also gained functions for sending and transmitting / receiving UTF-8. These were steps 1 and 2 in the process for getting the full unicode character set. But there are several additional steps to actually get there. Important additional ones include creating equivalents of \:wxyz for non-BMP characters; getting the kernel, MathLink, and FE successfully talking to each other using these new methods; and finding all in the places where the assumption that characters lie in the range 0-65535 is hardcoded, either implictly or explicitly, and updating the code. We have made progress on some of these internally, but as you might imagine its an on-going process and we certainly can't promise the feature on any particular timeline.

POSTED BY: Itai Seggev
POSTED BY: Szabolcs Horvát
Posted 9 years ago
POSTED BY: Itai Seggev

How do I ask my int8 matrix to be converted to a ByteArray? I can't find an example...

POSTED BY: Sander Huisman

It is possible to get the contents of a HDF5 file as a ByteArray, it is one of the import elements. But I do not know what sorts of conversions it goes through to get there.

POSTED BY: Szabolcs Horvát

So this can be seen as a kind-of PackedArray, but just for bytes? Semi-related, what about HDF5 import/export? It would be great to directly import towards a ByteArray.

POSTED BY: Sander Huisman

Do you mean the reverse, i.e. that not all Strings are a valid byte sequence? This gives True:

Let me rephrase what I wrote, I should have been clearer. You can indeed represent any unicode character codepoint from 0 to 65535 in a String. Internally though, characters are encoded on bytes, which requires to define a character encoding (i.e. a consistent way of representing values from 0 to 65535 using bytes). Given a character encoding, some byte sequences may be invalid.

e.g: In UTF-8 192 is not a valid byte and FromCharacterCode[192, "UTF-8"] returns an error.

From this follows that when you build a String out of bytes, its content is encoded. That's not required with byte arrays. I hope it clarifies.

POSTED BY: Dorian Birraux
POSTED BY: Szabolcs Horvát

ByteArray represents bytes, internally using one byte per value which make them space efficient. Because the data are binary, the performance gain of using them in place of string of bytes, is significant. Indeed, not all byte sequences are a valid String, so, when one stores bytes in a string, the data needs to be validated. That not required with ByteArray.

There is an effort to implement most if not all the features you've listed as top level function. In the mean time, here are some non documented functions that you may find interesting, as a complement to those you already mentioned:

  • Developer`EncodeBase64: ByteArray to String convertion, takes a byte array, returns a base64 string.
In[1]:= Developer`EncodeBase64[ByteArray[Range[5]]]
Out[1]= "AQIDBAU="
  • BinaryWrite accepts ByteArray, as an undocumented feature. I'm not aware of a reverse function that reads a file directly into a byte array.

Generally speaking, ByteArray is recommended everywhere one uses binary data. It leads me to talk about BinarySerialize. BinarySerialize serializes any Wolfram Language expression to a binary representation that is platform independent and fast to deserialize. Contrary to MX which contains all the definitions of a given expression, like DownValues, BinarySerialize is more data oriented, and only somehow represents the FullForm of an expression. The format used by BinarySerialize has simple enough specifications that we may consider publishing them.

As users noticed on StackOverflow, BinarySerialize does not always produce a smaller output with respect to ByteCount. One reason to that is, contrary to Compress, BinarySerialize does not automatically performs compression of the output. You pay the cost of a zlib compression only if needs be. Also some expressions like arrays (packed, raw) are already efficiently stored in memory, so having an output size of roughly the size of the byte count is generally a good result (Range produces packed array).

POSTED BY: Dorian Birraux
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard