Message Boards Message Boards

6
|
22247 Views
|
14 Replies
|
28 Total Likes
View groups...
Share
Share this post:

[?] What is the intended purpose of ByteArray & how can we use/convert it?

Posted 8 years ago

I was wondering what the intended purpose of the ByteArray type was. The cryptography functionality seems to be using it. And in version 11.1 we have BinarySerialize, which people are also a bit confused about (including myself, so consider that function included in this question as well).

The most straightforward guess about BinaryArray is that it is a space-efficient and consistent way to store binary data. We could use a list of integers, but that is not space efficient (each takes at least 8 bytes) and the 0..255 range is not enforced.

If such a space efficient data type is to be useful, it should be possible to convert/transfer it without the overhead of an inefficient integer-list intermediate representation.

How can we convert/transfer ByteArray to/from:

  • Files. Is there a function like BinaryReadList to handle it?

  • LibraryLink. Can I transfer a byte array efficiently to C? Can I convert it to a byte-type RawArray (which is already supported by LibraryLink)?

  • Strings. String are sometimes used to represent the contents of files, or binary data, in a byte-perfect way. We have ImportSting/ExportString for this reason. Strings are not as good for this purpose as a real byte array because each character takes 2 bytes (and hopefully this will change in the future to allow for things beyond the basic multilingual plane in Unicode)

  • Base64 encoded data in strings. This is how ByteArrays show up in InputForm, though the documentation suggests that they are stored more efficient internally. Such a string can be converted to a ByteArray using Developer`DecodeBase64ToByteArray. What about the reverse conversion?

It would also be nice to have an equivalent of StringToStream for ByteArrays.

If some of the above are not possible, please consider them a feature request. Regarding reading/writing from/to files, a lightweight function would be preferred (as opposed to the heavyweight, high overhead Import/Export which cannot even be used during initialization, i.e. in init.m)

What can we do with ByteArrays other than use them with the cryptography functions?

The documentation mentions that we can use Part, First, Last, Min, Max.

By experimentation, Take, Drop, Length, Dimensions, Rest, Most also work.

So do BitAnd, BitOr, etc.

HTTPRequestData and related functions support the property "BodyByteArray"

Is there anything else?

POSTED BY: Szabolcs Horvát
14 Replies
Posted 8 years ago

As Dorian indicated, many of the requested features are in the works. Indeed, some of them were planned for the initialze release of ByteArray, but were delayed for various reasons. We certainly hope to gradually added them over the course the next few 11.x release.

A bit more of context and answers to questions (both excplicit and implicit).

1) ByteArrays are not really arrays (at least in the sense of the Wolfram Language). The represent one dimensional data of 8-bit unsigned integers. In that sense, ByteList or ByteRow might have been clearer. You can think of them as the closest thing WL has to a char *, from that POV of the name ByteArray makes sense.

2) I think it would be better to think of them as being like SparseArray or StructuredArray, as opposed to packed array. They are not transparent to user-level functions, but they overload many basic language constructs like Part, Take, etc so that that appear like a 1-dimensional list.

3) One very big difference between SparseArray/StructuredArray and ByteArray is that ByteArray is opaque to Listable functions. This is intentional, because we don't them to be accidentally converted something else. And even some thing as simple as Plus: what does addition mean? Do individual entries overflow? Does it get converted to a normal list? Does it get converter to some future TwoByteArray? (We certainly won't have something by that name, but the idea is clear enough.

4) Unlike PackedArray which has a FullForm like its normal list, ByteArray uses Base64 so that it efficiently pack its values when you Put / Get it to files, not just in memory.

5) We internally encode strings in a variant of UTF-8. Now, of course, any byte can be faithfully converted to/from ISO8859-1, but that encoding only equals UTF-8 for the lower 7 bits. For other values, you need to use multiple bytes per character. So using a string to store byte data is both less space efficient and time efficient (since you need to ensure to correct conversion between the two encodings.)

POSTED BY: Itai Seggev

Thanks for the comments @Itai. This question is completely unrelated, but your mention of UTF-8 caught my eye. I thought that Mathematica used UCS-2, i.e UTF-16 without surrogate pairs. This means that it is limited to the basic multilingual plane.

A long time ago (circa Mathematica 8), I wrote a little program that would allow me to enter components of Chinese characters and return larger characters that contained these parts. This made it easier for me to look up characters in dictionaries while learning Chinese. The problem was that the character breakdown database that I used relied on several characters that were not in the basic multilingual plane, and could not be encoded in UCS-2. Thus I had to decode and handle UTF-16 surrogate pairs manually ...

I vaguely remember @John Fultz saying that this situation came about because Mathematica adopted Unicode before the spec was finalized, and got stuck with supporting only ~65,000 characters. But that there were plans to remedy this and MathLink (?) was already gaining proper UTF-8 support. However, my memory may be incorrect, and I cannot find the original posts.

So to get to the point: You mentioned UTF-8 and I was wondering if Mathematica is about to get full Unicode support in the near future, including support for the supplementary planes. With all the text processing functions it has, this would be quite useful.

POSTED BY: Szabolcs Horvát
Posted 8 years ago

The kernel used to use UCS-2 internally. It now uses a variant of UTF-8. MathLink also gained functions for sending and transmitting / receiving UTF-8. These were steps 1 and 2 in the process for getting the full unicode character set. But there are several additional steps to actually get there. Important additional ones include creating equivalents of \:wxyz for non-BMP characters; getting the kernel, MathLink, and FE successfully talking to each other using these new methods; and finding all in the places where the assumption that characters lie in the range 0-65535 is hardcoded, either implictly or explicitly, and updating the code. We have made progress on some of these internally, but as you might imagine its an on-going process and we certainly can't promise the feature on any particular timeline.

POSTED BY: Itai Seggev

Thanks for the comments! It is encouraging that work is being done towards this goal.

POSTED BY: Szabolcs Horvát
Posted 7 years ago

Please do not forget about Unicode in file paths:

POSTED BY: Alexey Popkov

Hi @Itai Seggev and @Dorian Birraux,

There seem to be basically three different efficient representations of byte sequences: strings, byte arrays, and byte-type rank-1 RawArrays.

Only RawArrays can be exchanged with C code nicely. (Strings are supported in LibraryLink, but handling them is cumbersome, and they are assumed to be null-terminated.)

I found that a ByteArray can be converted to a RawArray:

In[4]:= ba = ByteArray[Range[10]];

In[5]:= RawArray["Byte", ba]
Out[5]= RawArray["UnsignedInteger8", "<" 10 ">"]

What about the reverse? How can I convert a rank-1 byte-type RawArray into a ByteArray without unpacking it first into a list of 64-bit machine integers (and blow up the storage requirements 8-fold)?

POSTED BY: Szabolcs Horvát

You can't directly create a ByteArray from a RawArray. But, I see no reason not to support it, since we already have ByteArray from PackedArray.

POSTED BY: Dorian Birraux

Thanks for the response Dorian! In the meantime I also got a response on StackExchange, which pointed out that the type specification "ByteArray" can be used in LibraryFunctionLoad. In C code, it can be treated as a byte-type rank-1 RawArray. In Mathematica it will be a ByteArray. Thus one can write a simple library function that just returns a RawArray that was passed to it, but load it as LibraryFunctionLoad[..., {"RawArray"}, "ByteArray"].

I am looking forward to all this functionality becoming documented and brought to completion!

POSTED BY: Szabolcs Horvát

Thank you for the response Dorian.

... not all byte sequences are a valid String, so, when one stores bytes in a string, the data needs to be validated

Do you mean the reverse, i.e. that not all Strings are a valid byte sequence? This gives True:

tup = Tuples[Range[0, 255], {2}];
tup2 = ToCharacterCode /@ FromCharacterCode /@ tup;
tup2 === tup

All possible byte values, including 0, seem to be storable in Strings.

But either way, it is clear enough that a dedicated ByteArray is better for storing byte data than a string. That doesn't need to be explained further.


Other than the ones I mentioned, are there any operations we can perform on ByteArrays (especially other things than element extraction)?

Here are a few more suggestions, in addition to the ones I already mentions:

Efficiently changing elements in-place through Part:

a = ByteArray[...];
a[[2]] = 5;

(This should also support Span, i.e. ;;)

Append, Prepend, AppendTo, PrependTo.

Something like Partition to break a big array into parts. My envisioned use case is processing a large ByteArray without unpacking the whole thing to an integer list. Instead, we could unpack small sections at a time, process them, then re-pack them. So perhaps other methods, such as Map, BlockMap, etc. are more appropriate. (E.g., Audio has AudioBlockMap). If ByteCount can be trusted, there is a storage overhead of 96 bytes, so perhaps complete pre-Partition-ing is not the best.

Direct creation functions: Analogues of ConstantArray (large constant byte array) and RandomInteger (for random bytes).

But the most important missing functionality is conversion:

  • to/from strings (like FromCharacterCode, ToCharacterCode)
  • to/from files (BinaryRealList, BinaryWrite)
  • and very importantly: LibraryLink. Conversion to/from RawArray would suffice, as RawArrays work with LibraryLink since version 10.4. This would allow us to implement efficient functions for anything we need. I looked into the implementation of some of the built-in functions, and I see that currently sending to/from LibraryLink is done through an inefficient conversion to a 64-bit integer list (i.e. {Integer, 1} LibraryLink type).
POSTED BY: Szabolcs Horvát

Do you mean the reverse, i.e. that not all Strings are a valid byte sequence? This gives True:

Let me rephrase what I wrote, I should have been clearer. You can indeed represent any unicode character codepoint from 0 to 65535 in a String. Internally though, characters are encoded on bytes, which requires to define a character encoding (i.e. a consistent way of representing values from 0 to 65535 using bytes). Given a character encoding, some byte sequences may be invalid.

e.g: In UTF-8 192 is not a valid byte and FromCharacterCode[192, "UTF-8"] returns an error.

From this follows that when you build a String out of bytes, its content is encoded. That's not required with byte arrays. I hope it clarifies.

POSTED BY: Dorian Birraux

ByteArray represents bytes, internally using one byte per value which make them space efficient. Because the data are binary, the performance gain of using them in place of string of bytes, is significant. Indeed, not all byte sequences are a valid String, so, when one stores bytes in a string, the data needs to be validated. That not required with ByteArray.

There is an effort to implement most if not all the features you've listed as top level function. In the mean time, here are some non documented functions that you may find interesting, as a complement to those you already mentioned:

  • Developer`EncodeBase64: ByteArray to String convertion, takes a byte array, returns a base64 string.
In[1]:= Developer`EncodeBase64[ByteArray[Range[5]]]
Out[1]= "AQIDBAU="
  • BinaryWrite accepts ByteArray, as an undocumented feature. I'm not aware of a reverse function that reads a file directly into a byte array.

Generally speaking, ByteArray is recommended everywhere one uses binary data. It leads me to talk about BinarySerialize. BinarySerialize serializes any Wolfram Language expression to a binary representation that is platform independent and fast to deserialize. Contrary to MX which contains all the definitions of a given expression, like DownValues, BinarySerialize is more data oriented, and only somehow represents the FullForm of an expression. The format used by BinarySerialize has simple enough specifications that we may consider publishing them.

As users noticed on StackOverflow, BinarySerialize does not always produce a smaller output with respect to ByteCount. One reason to that is, contrary to Compress, BinarySerialize does not automatically performs compression of the output. You pay the cost of a zlib compression only if needs be. Also some expressions like arrays (packed, raw) are already efficiently stored in memory, so having an output size of roughly the size of the byte count is generally a good result (Range produces packed array).

POSTED BY: Dorian Birraux

So this can be seen as a kind-of PackedArray, but just for bytes? Semi-related, what about HDF5 import/export? It would be great to directly import towards a ByteArray.

POSTED BY: Sander Huisman

It is possible to get the contents of a HDF5 file as a ByteArray, it is one of the import elements. But I do not know what sorts of conversions it goes through to get there.

POSTED BY: Szabolcs Horvát

How do I ask my int8 matrix to be converted to a ByteArray? I can't find an example...

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract