Message Boards Message Boards

Shared-memory parallelism with large data

I was hoping to seek the advise of others that may have tackled (or been tackled by) the problem of having a large dataset to query in Mathematica.

My situation is that I have a large association (ByteCount 100Gb constructed from sequences in the human genome) that I am using as a hash table because the performance combination of Lookup with Associations is spectacular. Even with the amazing performance mentioned, I still expect the querying of the association to be a bottleneck because of the number of queries to the association that is needed to process multiple datasets of experimental data.

My experience with ParallelTable and friends in Mathematica is that you get the best performance benefit if you are transmitting very little data back and forth between kernels. If this condition is not met, then the performance of parallel constructs is often worse that their equivalent serial peers.

Questions that come to mind:

A. Is there any known way of use parallel constructs to query an association that avoids the need for each parallel kernel to have it's own copy of the association? (I am looking for concurrent hash table functionality in Mathematica.)

B. Since I suspect the answer to Question A is "no", does anyone know if the good folks at Wolfram are working to improve the capabilities / performance of parallel functionality in Mathematica? I am not looking for deep insider secrets, but it does seem that attention to this functionality had waned as documentation reveals the last revision for many parallel commands was in 2010.

C. Has anyone used an external concurrent hash table in their own projects that they linked back into Mathematica? Would you be willing to share your experience?

I appreciate your thoughts and advice.

Todd

POSTED BY: Todd Allen
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract