Group Abstract

Message Boards

WOLFRAM COMMUNITY

26.9K Views

14 Replies

28 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

[?] What is the intended purpose of ByteArray & how can we use/convert it?

Szabolcs Horvát

Posted 9 years ago

I was wondering what the intended purpose of the `ByteArray` type was. The cryptography functionality seems to be using it. And in version 11.1 we have `BinarySerialize`, which people are also a bit confused about (including myself, so consider that function included in this question as well). The most straightforward guess about `BinaryArray` is that it is a space-efficient and consistent way to store binary data. We could use a list of integers, but that is not space efficient (each takes at least 8 bytes) and the `0..255` range is not enforced. If such a space efficient data type is to be useful, it should be possible to convert/transfer it without the overhead of an inefficient integer-list intermediate representation. How can we convert/transfer `ByteArray` to/from: Files. Is there a function like `BinaryReadList` to handle it? LibraryLink. Can I transfer a byte array efficiently to C? Can I convert it to a byte-type RawArray (which is already supported by LibraryLink)? Strings. String are sometimes used to represent the contents of files, or binary data, in a byte-perfect way. We have `ImportSting`/`ExportString` for this reason. Strings are not as good for this purpose as a real byte array because each character takes 2 bytes (and hopefully this will change in the future to allow for things beyond the basic multilingual plane in Unicode) Base64 encoded data in strings. This is how ByteArrays show up in `InputForm`, though the documentation suggests that they are stored more efficient internally. Such a string can be converted to a `ByteArray` using Developer`DecodeBase64ToByteArray. What about the reverse conversion? It would also be nice to have an equivalent of `StringToStream` for `ByteArray`s. If some of the above are not possible, please consider them a feature request. Regarding reading/writing from/to files, a lightweight function would be preferred (as opposed to the heavyweight, high overhead `Import`/`Export` which cannot even be used during initialization, i.e. in `init.m`) What can we do with `ByteArray`s other than use them with the cryptography functions? The documentation mentions that we can use `Part`, `First`, `Last`, `Min`, `Max`. By experimentation, `Take`, `Drop`, `Length`, `Dimensions`, `Rest`, `Most` also work. So do `BitAnd`, `BitOr`, etc. `HTTPRequestData` and related functions support the property `"BodyByteArray"` Is there anything else?

POSTED BY: Szabolcs Horvát

14 Replies

Sort By:

Szabolcs Horvát

Posted 8 years ago

Thanks for the response Dorian! In the meantime I also got a response on StackExchange, which pointed out that the type specification `"ByteArray"` can be used in `LibraryFunctionLoad`. In C code, it can be treated as a byte-type rank-1 RawArray. In Mathematica it will be a `ByteArray`. Thus one can write a simple library function that just returns a RawArray that was passed to it, but load it as `LibraryFunctionLoad[..., {"RawArray"}, "ByteArray"]`. I am looking forward to all this functionality becoming documented and brought to completion!

POSTED BY: Szabolcs Horvát

Dorian Birraux

Dorian Birraux, WOLFRAM

Posted 8 years ago

You can't directly create a ByteArray from a RawArray. But, I see no reason not to support it, since we already have ByteArray from PackedArray.

POSTED BY: Dorian Birraux

Szabolcs Horvát

Posted 8 years ago

Hi @Itai Seggev and @Dorian Birraux, There seem to be basically three different efficient representations of byte sequences: strings, byte arrays, and byte-type rank-1 RawArrays. Only RawArrays can be exchanged with C code nicely. (Strings are supported in LibraryLink, but handling them is cumbersome, and they are assumed to be null-terminated.) I found that a ByteArray can be converted to a RawArray: In[4]:= ba = ByteArray[Range[10]]; In[5]:= RawArray["Byte", ba] Out[5]= RawArray["UnsignedInteger8", "<" 10 ">"] What about the reverse? How can I convert a rank-1 byte-type RawArray into a ByteArray without unpacking it first into a list of 64-bit machine integers (and blow up the storage requirements 8-fold)?

POSTED BY: Szabolcs Horvát

Alexey Popkov

Posted 9 years ago

POSTED BY: Alexey Popkov

Szabolcs Horvát

Posted 9 years ago

Thanks for the comments! It is encouraging that work is being done towards this goal.

POSTED BY: Szabolcs Horvát

Itai Seggev

Itai Seggev, WOLFRAM

Posted 9 years ago

The kernel used to use UCS-2 internally. It now uses a variant of UTF-8. MathLink also gained functions for sending and transmitting / receiving UTF-8. These were steps 1 and 2 in the process for getting the full unicode character set. But there are several additional steps to actually get there. Important additional ones include creating equivalents of \:wxyz for non-BMP characters; getting the kernel, MathLink, and FE successfully talking to each other using these new methods; and finding all in the places where the assumption that characters lie in the range 0-65535 is hardcoded, either implictly or explicitly, and updating the code. We have made progress on some of these internally, but as you might imagine its an on-going process and we certainly can't promise the feature on any particular timeline.

POSTED BY: Itai Seggev

Szabolcs Horvát

Posted 9 years ago

Thanks for the comments @Itai. This question is completely unrelated, but your mention of UTF-8 caught my eye. I thought that Mathematica used UCS-2, i.e UTF-16 without surrogate pairs. This means that it is limited to the basic multilingual plane. A long time ago (circa Mathematica 8), I wrote a little program that would allow me to enter components of Chinese characters and return larger characters that contained these parts. This made it easier for me to look up characters in dictionaries while learning Chinese. The problem was that the character breakdown database that I used relied on several characters that were not in the basic multilingual plane, and could not be encoded in UCS-2. Thus I had to decode and handle UTF-16 surrogate pairs manually ... I vaguely remember @John Fultz saying that this situation came about because Mathematica adopted Unicode before the spec was finalized, and got stuck with supporting only ~65,000 characters. But that there were plans to remedy this and MathLink (?) was already gaining proper UTF-8 support. However, my memory may be incorrect, and I cannot find the original posts. So to get to the point: You mentioned UTF-8 and I was wondering if Mathematica is about to get full Unicode support in the near future, including support for the supplementary planes. With all the text processing functions it has, this would be quite useful.

POSTED BY: Szabolcs Horvát

Itai Seggev

Itai Seggev, WOLFRAM

Posted 9 years ago

As Dorian indicated, many of the requested features are in the works. Indeed, some of them were planned for the initialze release of ByteArray, but were delayed for various reasons. We certainly hope to gradually added them over the course the next few 11.x release. A bit more of context and answers to questions (both excplicit and implicit). 1) ByteArrays are not really arrays (at least in the sense of the Wolfram Language). The represent one dimensional data of 8-bit unsigned integers. In that sense, ByteList or ByteRow might have been clearer. You can think of them as the closest thing WL has to a char *, from that POV of the name ByteArray makes sense. 2) I think it would be better to think of them as being like SparseArray or StructuredArray, as opposed to packed array. They are not transparent to user-level functions, but they overload many basic language constructs like Part, Take, etc so that that appear like a 1-dimensional list. 3) One very big difference between SparseArray/StructuredArray and ByteArray is that ByteArray is opaque to Listable functions. This is intentional, because we don't them to be accidentally converted something else. And even some thing as simple as Plus: what does addition mean? Do individual entries overflow? Does it get converted to a normal list? Does it get converter to some future TwoByteArray? (We certainly won't have something by that name, but the idea is clear enough. 4) Unlike PackedArray which has a FullForm like its normal list, ByteArray uses Base64 so that it efficiently pack its values when you Put / Get it to files, not just in memory. 5) We internally encode strings in a variant of UTF-8. Now, of course, any byte can be faithfully converted to/from ISO8859-1, but that encoding only equals UTF-8 for the lower 7 bits. For other values, you need to use multiple bytes per character. So using a string to store byte data is both less space efficient and time efficient (since you need to ensure to correct conversion between the two encodings.)

POSTED BY: Itai Seggev

Sander Huisman

Sander Huisman, University of Twente

Posted 9 years ago

How do I ask my int8 matrix to be converted to a ByteArray? I can't find an example...

POSTED BY: Sander Huisman

Szabolcs Horvát

Posted 9 years ago

It is possible to get the contents of a HDF5 file as a ByteArray, it is one of the import elements. But I do not know what sorts of conversions it goes through to get there.

POSTED BY: Szabolcs Horvát

Sander Huisman

Sander Huisman, University of Twente

Posted 9 years ago

So this can be seen as a kind-of PackedArray, but just for bytes? Semi-related, what about HDF5 import/export? It would be great to directly import towards a ByteArray.

POSTED BY: Sander Huisman

Dorian Birraux

Dorian Birraux, WOLFRAM

Posted 9 years ago

Do you mean the reverse, i.e. that not all Strings are a valid byte sequence? This gives True: Let me rephrase what I wrote, I should have been clearer. You can indeed represent any unicode character codepoint from 0 to 65535 in a `String`. Internally though, characters are encoded on bytes, which requires to define a character encoding (i.e. a consistent way of representing values from 0 to 65535 using bytes). Given a character encoding, some byte sequences may be invalid. e.g: In UTF-8 `192` is not a valid byte and `FromCharacterCode[192, "UTF-8"]` returns an error. From this follows that when you build a `String` out of bytes, its content is encoded. That's not required with byte arrays. I hope it clarifies.

POSTED BY: Dorian Birraux

Szabolcs Horvát

Posted 9 years ago

Thank you for the response Dorian. ... not all byte sequences are a valid `String`, so, when one stores bytes in a string, the data needs to be validated Do you mean the reverse, i.e. that not all `String`s are a valid byte sequence? This gives `True`: tup = Tuples[Range[0, 255], {2}]; tup2 = ToCharacterCode /@ FromCharacterCode /@ tup; tup2 === tup All possible byte values, including `0`, seem to be storable in `String`s. But either way, it is clear enough that a dedicated `ByteArray` is better for storing byte data than a string. That doesn't need to be explained further. Other than the ones I mentioned, are there any operations we can perform on `ByteArray`s (especially other things than element extraction)? Here are a few more suggestions, in addition to the ones I already mentions: Efficiently changing elements in-place through `Part`: a = ByteArray[...]; a[[2]] = 5; (This should also support `Span`, i.e. `;;`) `Append`, `Prepend`, `AppendTo`, `PrependTo`. Something like `Partition` to break a big array into parts. My envisioned use case is processing a large `ByteArray` without unpacking the whole thing to an integer list. Instead, we could unpack small sections at a time, process them, then re-pack them. So perhaps other methods, such as `Map`, `BlockMap`, etc. are more appropriate. (E.g., `Audio` has `AudioBlockMap`). If `ByteCount` can be trusted, there is a storage overhead of 96 bytes, so perhaps complete pre-`Partition`-ing is not the best. Direct creation functions: Analogues of `ConstantArray` (large constant byte array) and `RandomInteger` (for random bytes). But the most important missing functionality is conversion: to/from strings (like FromCharacterCode, ToCharacterCode) to/from files (BinaryRealList, BinaryWrite) and very importantly: LibraryLink. Conversion to/from RawArray would suffice, as `RawArray`s work with LibraryLink since version 10.4. This would allow us to implement efficient functions for anything we need. I looked into the implementation of some of the built-in functions, and I see that currently sending to/from LibraryLink is done through an inefficient conversion to a 64-bit integer list (i.e. `{Integer, 1}` LibraryLink type).

POSTED BY: Szabolcs Horvát

Dorian Birraux

Dorian Birraux, WOLFRAM

Posted 9 years ago

`ByteArray` represents bytes, internally using one byte per value which make them space efficient. Because the data are binary, the performance gain of using them in place of string of bytes, is significant. Indeed, not all byte sequences are a valid `String`, so, when one stores bytes in a string, the data needs to be validated. That not required with `ByteArray`. There is an effort to implement most if not all the features you've listed as top level function. In the mean time, here are some non documented functions that you may find interesting, as a complement to those you already mentioned: Developer`EncodeBase64: `ByteArray` to `String` convertion, takes a byte array, returns a base64 string. In[1]:= Developer`EncodeBase64[ByteArray[Range[5]]] Out[1]= "AQIDBAU=" `BinaryWrite` accepts `ByteArray`, as an undocumented feature. I'm not aware of a reverse function that reads a file directly into a byte array. Generally speaking, `ByteArray` is recommended everywhere one uses binary data. It leads me to talk about `BinarySerialize`. `BinarySerialize` serializes any Wolfram Language expression to a binary representation that is platform independent and fast to deserialize. Contrary to MX which contains all the definitions of a given expression, like `DownValues`, `BinarySerialize` is more data oriented, and only somehow represents the `FullForm` of an expression. The format used by BinarySerialize has simple enough specifications that we may consider publishing them. As users noticed on StackOverflow, `BinarySerialize` does not always produce a smaller output with respect to `ByteCount`. One reason to that is, contrary to `Compress`, `BinarySerialize` does not automatically performs compression of the output. You pay the cost of a zlib compression only if needs be. Also some expressions like arrays (packed, raw) are already efficiently stored in memory, so having an output size of roughly the size of the byte count is generally a good result (`Range` produces packed array).

POSTED BY: Dorian Birraux

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback