Hi Taliesin,
Thank you for the comments and tips! When I read your post, what immediately struck me was: why are you not using MathLink for this instead of JSON? MathLink must surely be faster than serializing to a textual representation. Or is it?
So I tried it out:
- I generate a list of integer arrays of random lengths between 0..10 (I keep them short to eliminate any possible packed array advantage MathLink might or might not have). Integer can be up to 1000000000.
- Then I send this to Mathematica using either MathLink or JSON (using RapidJSON, which claims to be very fast).
And indeed, the JSON version is faster ...
This generates a list of
$2^{21}$ tiny integer lists:
In[36]:= obj@"generate"[2^21]
Transfer using MathLink:
In[38]:= expr = obj@"getML"[]; // AbsoluteTiming
Out[38]= {1.94122, Null}
Transfer using JSON:
In[40]:= AbsoluteTiming[
expr2 = Developer`ReadRawJSONString[obj@"getJSON"[]];
obj@"releaseJSONBuffer"[];
]
Out[40]= {1.33406, Null}
In[41]:= expr == expr2
Out[41]= True
Th JSON version is indeed faster.
But how is that possible? Doesn't MathLink use a binary representation for this, and shouldn't that take up less space and be faster?
Effectively this is how I transferred the data using MathLink
<!-- language: lang-c -->
std::vector<std::vector<int>> list;
...
MLPutFunction(link, "List", list.size())
for (const auto &vec: list)
MLPutInteger32List(link, vec.data(), vec.size());
I did notice that the result does take up a lot more space in Mathematica than in JSON serialization:
In[31]:= Developer`WriteRawJSONString[expr] // ByteCount
Out[31]= 147593352
In[32]:= ByteCount[expr]
Out[32]= 352419368
That is understandable because: in JSON a 32-bit integer is only 10 digits or less, i.e. 10 bytes. In Mathematica each (non-packed-array-member) integer is 8 bytes plus some meta information totalling to 16 bytes according to ByteCount
.
But MathLink should be more efficient than that: given that I use MLPutInteger32List
and I am not putting each integer one bye one, it should in principle be able to transfer them in some "packed" format, furthermore it should only use 32-bit (not 64) for each, until they are read by the kernel.
Does this mean that MathLink is due for an update? Or does it have some inherent limitation which prevents it from being more efficient than it already is? Or perhaps we see the function call overhead compared to a header-only (thus fully inlineable) JSON library? It is should definitely be possible to make a binary format faster than a text-based JSON (maybe Cap'n Proto which you mentioned before, or similar).
If I generated random-length lists in the length range 0..100 instead of 0..10, then the performance advantage of JSON goes away.
In[43]:= obj@"generate"[2^18]
In[44]:= expr = obj@"getML"[]; // AbsoluteTiming
Out[44]= {1.47904, Null}
In[45]:= AbsoluteTiming[
expr2 = Developer`ReadRawJSONString[obj@"getJSON"[]];
obj@"releaseJSONBuffer"[];
]
Out[45]= {1.78779, Null}
Another question: If you use JSON transfer for a machine learning application, isn't it a problem with that converting from binary to decimal and back doesn't leave floating point numbers intact? There may be a very small rounding error.