Message Boards Message Boards

GROUPS:

Data Science in the Wolfram Language - the HDF5 File Format

Posted 3 months ago
987 Views
|
6 Replies
|
13 Total Likes
|

One of the major challenges that users face when trying to do data science in Mathematica with the Wolfram Language is how to handle big data. Leaving aside the important topic of database connectivity/functionality and the handling of data too large to fit in memory, my concern here is with the issue of how to handle large data files, which are often in csv format, but which are not too large to fit into available memory.

It is well known that, due to their generality, Mathematica's Import and Export functions are horribly slow when handling large csv files, for example:

A = RandomReal[{0, 1}, 10000000];
Export["test.csv", A] // AbsoluteTiming

{ 255.71, "test.csv"}

and

Acsv = Flatten@Import["test.csv"]; // AbsoluteTiming
{27.4954, Null}

Performance results like these leave the user with the impression that Mathematica is suitable for handling only "toy" problems, rather than the kind of large and complex data challenges faced by data scientists in the real world.

Sure, you can speed this up with ReadLine, but not by much, after doing all the string processing. And while the mx binary file format speeds up data handling enormously, it doesn't address the issue of how to get the data into the requisite file format, other than via the WL DumpSave function - in other words, the data already has to be in a Mathematica notebook in order to write an mx file.

A major step in the right direction has been achieved through the significant effort that WR has put into implementing the HDF5 binary file format standard in the Wolfram Language. This serves two purposes: firstly, it can speed up the storage and retrieval of large datasets, by orders of magnitudes (depending on the data type); secondly, unlike Wolfram's proprietary mx file format, HDF5 is an open source format that can store large, complex datasets that are accessible via Python, R and MatLab, as well as other languages/platforms, including Mathematica. So, working with the same dataset as before, but using HDF5 format, we get an speed-up of around 500x on the file write and around 270x on the file read:

Export["test.h5", A] // AbsoluteTiming
{0.490421, "test.h5"}

Ah5 = Import["test.h5", {"Data", 1}]; // AbsoluteTiming
{0.108682, Null}
Ah5 == A
True

So it becomes perfectly feasible to envisage a workflow in which some pre-processing of a very large dataset in csv format takes place initially in e.g. Python Pandas, the results of which are exported to a HDF5 format file for further processing in Mathematica.

It seems to me that this advance does a great deal to address some of the major concerns about using Mathematica for large data science projects. And I am not sure that users are necessarily aware of its significance, given all the hoopla over more glamorous features that tend to get all the attention in new version releases.

6 Replies

Other simple formats include binary:

Export["test.bin", A, "Real64"] // AbsoluteTiming
0.35 sec

Import["test.bin", "Real64"]; // AbsoluteTiming
0.2 sec

Which is generally also pretty fast!

Looks even faster!. But are those file formats directly readable in Python/R, etc? And, more to the point, can those same file formats be generated from Python/R and read directly into Mathematica?

I suppose the other consideration is that HDF5 is capable of storing much more complex, hierarchical datasets compared to those I have used here.

Yes, can be read by any other machine and language. The only thing you have to make sure is that the machines have the same endianness. I frequently do it between c++ and Mathematica. Basically it is a direct copy of what the data looks like in RAM. So reading a file is just copying, no interpretation or processing steps. HDF5 does similar things I believe, and can handle multiple datasets and so on in a much more convenient manner. Reading in HDF5 in c++ is trickier though. Binary formats are very easy to read/write without packages/addins/libraries…

Import and Export of HDF5 are implemented using a paclet called HDF5Tools. Advanced users in performance-critical applications can use paclet functions directly to write code that avoids the overhead of Import and Export and is often significantly faster. Below is a quick demo.

Before we move on, a short disclaimer: internal functionality may not be well documented and is not guaranteed to work in future versions of the Wolfram Language.

First, we explicitly load and initialize the paclet (normally, Import does it under the hood):

In[2]:= Needs["HDF5Tools`"]

In[3]:= HDF5ToolsInit[True]

Out[3]= True

Then we create a new HDF5 file called "test.h5" with one group called "MyGroup":

In[4]:= fileId = h5fcreate["test.h5", H5FACCTRUNC];

In[5]:= myGroup =  h5gcreate[fileId, "MyGroup", H5PDEFAULT, H5PDEFAULT, H5PDEFAULT];

To write data to this file we need to create a dataset. We will call it "NewDataset". But first, a dataspace is needed:

In[6]:= dspace = h5screatesimplen[1, {10000000}];

In[7]:= dset = 
 h5dcreate[myGroup, "NewDataset", H5TNATIVEDOUBLE, dspace, H5PDEFAULT,
   H5PDEFAULT, H5PDEFAULT];

Now we are ready to write data to the file:

In[8]:= A = RandomReal[{0, 1}, 10000000];

In[9]:= h5dwrite[dset, H5TNATIVEDOUBLE, H5SALL, H5SALL, H5PDEFAULT,  A] // AbsoluteTiming

Out[9]= {0.0721832, 0}

Finally, we must manually close all created objects:

In[10]:= h5dclose@dset;
h5sclose@dspace;
h5gclose@myGroup;
h5fclose@fileId;

Reading the data could look like this: first we open the file

In[14]:= file = h5fopen["test.h5", H5FACCRDWR];

Then we open the dataset:

In[15]:= dset = h5dopen[file, "/MyGroup/NewDataset", H5PDEFAULT];

Now we can read the data from dataset to the Wolfram Language:

In[16]:= data = h5dread[dset, H5TNATIVEDOUBLE, H5SALL, H5SALL, H5PDEFAULT] //  AbsoluteTiming

Out[16]= {0.0269582, NumericArray[< 10000000 >, Real64]}

The data is returned in an efficient form of NumericArray. Finally, release resources:

In[17]:= h5dclose@dset;
h5fclose@file;

As you can see, the code requires certain knowledge of the HDF5 format and is much harder to write than a simple call to Import or Export but it's also noticeably faster.

I tested it with numpy arrays and threw in some np.nan's. These are stored as decimals so it doesn't slow down the processing and WL translates them into Indeterminate, which seems reasonable.

So your solution works great for numpy data arrays. but I guess not for pandas Series or DataFrames.

Rafal,

Thanks for the further detailed explanation.

I've been experiencing some difficulty creating HDF5 files in WL and reading them into Python, and vice-versa.

If you happened to have a couple of canned examples of how the function parameters should be set to ensure compatibility, that would be very helpful.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract