Message Boards Message Boards

GROUPS:

Data Science in the Wolfram Language - the HDF5 File Format

Posted 3 months ago
893 Views
|
6 Replies
|
13 Total Likes
|
6 Replies

Rafal,

Thanks for the further detailed explanation.

I've been experiencing some difficulty creating HDF5 files in WL and reading them into Python, and vice-versa.

If you happened to have a couple of canned examples of how the function parameters should be set to ensure compatibility, that would be very helpful.

I tested it with numpy arrays and threw in some np.nan's. These are stored as decimals so it doesn't slow down the processing and WL translates them into Indeterminate, which seems reasonable.

So your solution works great for numpy data arrays. but I guess not for pandas Series or DataFrames.

Import and Export of HDF5 are implemented using a paclet called HDF5Tools. Advanced users in performance-critical applications can use paclet functions directly to write code that avoids the overhead of Import and Export and is often significantly faster. Below is a quick demo.

Before we move on, a short disclaimer: internal functionality may not be well documented and is not guaranteed to work in future versions of the Wolfram Language.

First, we explicitly load and initialize the paclet (normally, Import does it under the hood):

In[2]:= Needs["HDF5Tools`"]

In[3]:= HDF5ToolsInit[True]

Out[3]= True

Then we create a new HDF5 file called "test.h5" with one group called "MyGroup":

In[4]:= fileId = h5fcreate["test.h5", H5FACCTRUNC];

In[5]:= myGroup =  h5gcreate[fileId, "MyGroup", H5PDEFAULT, H5PDEFAULT, H5PDEFAULT];

To write data to this file we need to create a dataset. We will call it "NewDataset". But first, a dataspace is needed:

In[6]:= dspace = h5screatesimplen[1, {10000000}];

In[7]:= dset = 
 h5dcreate[myGroup, "NewDataset", H5TNATIVEDOUBLE, dspace, H5PDEFAULT,
   H5PDEFAULT, H5PDEFAULT];

Now we are ready to write data to the file:

In[8]:= A = RandomReal[{0, 1}, 10000000];

In[9]:= h5dwrite[dset, H5TNATIVEDOUBLE, H5SALL, H5SALL, H5PDEFAULT,  A] // AbsoluteTiming

Out[9]= {0.0721832, 0}

Finally, we must manually close all created objects:

In[10]:= h5dclose@dset;
h5sclose@dspace;
h5gclose@myGroup;
h5fclose@fileId;

Reading the data could look like this: first we open the file

In[14]:= file = h5fopen["test.h5", H5FACCRDWR];

Then we open the dataset:

In[15]:= dset = h5dopen[file, "/MyGroup/NewDataset", H5PDEFAULT];

Now we can read the data from dataset to the Wolfram Language:

In[16]:= data = h5dread[dset, H5TNATIVEDOUBLE, H5SALL, H5SALL, H5PDEFAULT] //  AbsoluteTiming

Out[16]= {0.0269582, NumericArray[< 10000000 >, Real64]}

The data is returned in an efficient form of NumericArray. Finally, release resources:

In[17]:= h5dclose@dset;
h5fclose@file;

As you can see, the code requires certain knowledge of the HDF5 format and is much harder to write than a simple call to Import or Export but it's also noticeably faster.

Yes, can be read by any other machine and language. The only thing you have to make sure is that the machines have the same endianness. I frequently do it between c++ and Mathematica. Basically it is a direct copy of what the data looks like in RAM. So reading a file is just copying, no interpretation or processing steps. HDF5 does similar things I believe, and can handle multiple datasets and so on in a much more convenient manner. Reading in HDF5 in c++ is trickier though. Binary formats are very easy to read/write without packages/addins/libraries…

Looks even faster!. But are those file formats directly readable in Python/R, etc? And, more to the point, can those same file formats be generated from Python/R and read directly into Mathematica?

I suppose the other consideration is that HDF5 is capable of storing much more complex, hierarchical datasets compared to those I have used here.

Other simple formats include binary:

Export["test.bin", A, "Real64"] // AbsoluteTiming
0.35 sec

Import["test.bin", "Real64"]; // AbsoluteTiming
0.2 sec

Which is generally also pretty fast!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract