Group Abstract Group Abstract

Message Boards Message Boards

6
|
23.4K Views
|
14 Replies
|
28 Total Likes
View groups...
Share
Share this post:

[?] What is the intended purpose of ByteArray & how can we use/convert it?

Posted 8 years ago
POSTED BY: Szabolcs Horvát
14 Replies
Posted 8 years ago
POSTED BY: Itai Seggev
POSTED BY: Szabolcs Horvát
Posted 8 years ago

The kernel used to use UCS-2 internally. It now uses a variant of UTF-8. MathLink also gained functions for sending and transmitting / receiving UTF-8. These were steps 1 and 2 in the process for getting the full unicode character set. But there are several additional steps to actually get there. Important additional ones include creating equivalents of \:wxyz for non-BMP characters; getting the kernel, MathLink, and FE successfully talking to each other using these new methods; and finding all in the places where the assumption that characters lie in the range 0-65535 is hardcoded, either implictly or explicitly, and updating the code. We have made progress on some of these internally, but as you might imagine its an on-going process and we certainly can't promise the feature on any particular timeline.

POSTED BY: Itai Seggev

Thanks for the comments! It is encouraging that work is being done towards this goal.

POSTED BY: Szabolcs Horvát
Posted 8 years ago

Please do not forget about Unicode in file paths:

POSTED BY: Alexey Popkov

Hi @Itai Seggev and @Dorian Birraux,

There seem to be basically three different efficient representations of byte sequences: strings, byte arrays, and byte-type rank-1 RawArrays.

Only RawArrays can be exchanged with C code nicely. (Strings are supported in LibraryLink, but handling them is cumbersome, and they are assumed to be null-terminated.)

I found that a ByteArray can be converted to a RawArray:

In[4]:= ba = ByteArray[Range[10]];

In[5]:= RawArray["Byte", ba]
Out[5]= RawArray["UnsignedInteger8", "<" 10 ">"]

What about the reverse? How can I convert a rank-1 byte-type RawArray into a ByteArray without unpacking it first into a list of 64-bit machine integers (and blow up the storage requirements 8-fold)?

POSTED BY: Szabolcs Horvát

You can't directly create a ByteArray from a RawArray. But, I see no reason not to support it, since we already have ByteArray from PackedArray.

POSTED BY: Dorian Birraux

Thanks for the response Dorian! In the meantime I also got a response on StackExchange, which pointed out that the type specification "ByteArray" can be used in LibraryFunctionLoad. In C code, it can be treated as a byte-type rank-1 RawArray. In Mathematica it will be a ByteArray. Thus one can write a simple library function that just returns a RawArray that was passed to it, but load it as LibraryFunctionLoad[..., {"RawArray"}, "ByteArray"].

I am looking forward to all this functionality becoming documented and brought to completion!

POSTED BY: Szabolcs Horvát

Thank you for the response Dorian.

... not all byte sequences are a valid String, so, when one stores bytes in a string, the data needs to be validated

Do you mean the reverse, i.e. that not all Strings are a valid byte sequence? This gives True:

tup = Tuples[Range[0, 255], {2}];
tup2 = ToCharacterCode /@ FromCharacterCode /@ tup;
tup2 === tup

All possible byte values, including 0, seem to be storable in Strings.

But either way, it is clear enough that a dedicated ByteArray is better for storing byte data than a string. That doesn't need to be explained further.


Other than the ones I mentioned, are there any operations we can perform on ByteArrays (especially other things than element extraction)?

Here are a few more suggestions, in addition to the ones I already mentions:

Efficiently changing elements in-place through Part:

a = ByteArray[...];
a[[2]] = 5;

(This should also support Span, i.e. ;;)

Append, Prepend, AppendTo, PrependTo.

Something like Partition to break a big array into parts. My envisioned use case is processing a large ByteArray without unpacking the whole thing to an integer list. Instead, we could unpack small sections at a time, process them, then re-pack them. So perhaps other methods, such as Map, BlockMap, etc. are more appropriate. (E.g., Audio has AudioBlockMap). If ByteCount can be trusted, there is a storage overhead of 96 bytes, so perhaps complete pre-Partition-ing is not the best.

Direct creation functions: Analogues of ConstantArray (large constant byte array) and RandomInteger (for random bytes).

But the most important missing functionality is conversion:

  • to/from strings (like FromCharacterCode, ToCharacterCode)
  • to/from files (BinaryRealList, BinaryWrite)
  • and very importantly: LibraryLink. Conversion to/from RawArray would suffice, as RawArrays work with LibraryLink since version 10.4. This would allow us to implement efficient functions for anything we need. I looked into the implementation of some of the built-in functions, and I see that currently sending to/from LibraryLink is done through an inefficient conversion to a 64-bit integer list (i.e. {Integer, 1} LibraryLink type).
POSTED BY: Szabolcs Horvát

Do you mean the reverse, i.e. that not all Strings are a valid byte sequence? This gives True:

Let me rephrase what I wrote, I should have been clearer. You can indeed represent any unicode character codepoint from 0 to 65535 in a String. Internally though, characters are encoded on bytes, which requires to define a character encoding (i.e. a consistent way of representing values from 0 to 65535 using bytes). Given a character encoding, some byte sequences may be invalid.

e.g: In UTF-8 192 is not a valid byte and FromCharacterCode[192, "UTF-8"] returns an error.

From this follows that when you build a String out of bytes, its content is encoded. That's not required with byte arrays. I hope it clarifies.

POSTED BY: Dorian Birraux

ByteArray represents bytes, internally using one byte per value which make them space efficient. Because the data are binary, the performance gain of using them in place of string of bytes, is significant. Indeed, not all byte sequences are a valid String, so, when one stores bytes in a string, the data needs to be validated. That not required with ByteArray.

There is an effort to implement most if not all the features you've listed as top level function. In the mean time, here are some non documented functions that you may find interesting, as a complement to those you already mentioned:

  • Developer`EncodeBase64: ByteArray to String convertion, takes a byte array, returns a base64 string.
In[1]:= Developer`EncodeBase64[ByteArray[Range[5]]]
Out[1]= "AQIDBAU="
  • BinaryWrite accepts ByteArray, as an undocumented feature. I'm not aware of a reverse function that reads a file directly into a byte array.

Generally speaking, ByteArray is recommended everywhere one uses binary data. It leads me to talk about BinarySerialize. BinarySerialize serializes any Wolfram Language expression to a binary representation that is platform independent and fast to deserialize. Contrary to MX which contains all the definitions of a given expression, like DownValues, BinarySerialize is more data oriented, and only somehow represents the FullForm of an expression. The format used by BinarySerialize has simple enough specifications that we may consider publishing them.

As users noticed on StackOverflow, BinarySerialize does not always produce a smaller output with respect to ByteCount. One reason to that is, contrary to Compress, BinarySerialize does not automatically performs compression of the output. You pay the cost of a zlib compression only if needs be. Also some expressions like arrays (packed, raw) are already efficiently stored in memory, so having an output size of roughly the size of the byte count is generally a good result (Range produces packed array).

POSTED BY: Dorian Birraux

So this can be seen as a kind-of PackedArray, but just for bytes? Semi-related, what about HDF5 import/export? It would be great to directly import towards a ByteArray.

POSTED BY: Sander Huisman

It is possible to get the contents of a HDF5 file as a ByteArray, it is one of the import elements. But I do not know what sorts of conversions it goes through to get there.

POSTED BY: Szabolcs Horvát

How do I ask my int8 matrix to be converted to a ByteArray? I can't find an example...

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard