The best way to work with a big data set?

Posted 9 years ago

I have a problem in which I need to work with a very large data set. The data is available as multiple CSV files, with the total byte count as much as 100G. The structure is relatively simple and would structure easily into a key->value system , I want to be able to analyze the data, which would usually require extracting subsets of the data -- these may be 10Gb -- and then analyzing the results by usual statistical methods.

If this were a smaller data set I would feel comfortable importing it into structured lists, as we did before V10, or importing it and assembling an associative dataset with the new tools. But here I am concerned about the size, which will certainly exceed what can be kept in RAM.

I have considered trying to map it into a V10 dataset. I have also wondered whether it would be possible to import it and export it as an SQL database, and then work with that.

I would be grateful for any advice.

Kind regards, David

POSTED BY: David Keith
My guess would be that 100GB you may be forced to keep it in a database structure for initial filtering and maybe convert the subsets to Key-Value, though 10GB is still a huge dataset. Please post your end-solution. Thanks

POSTED BY: Mark Decker
Posted 9 years ago

Thanks, Mark. My first need is to get it into a data structure. It starts as hundreds of CSV files.

POSTED BY: David Keith
