Message Boards Message Boards

Import large data files for Machine Learning?

Posted 5 years ago

I want to run a machine learning task on my Win 10 PC, 16GB RAM, Mathematica 11.3.0, but I am facing the following problems: training set size 10GB CSV file, with 700,000,000 x 2 datasets. Mathematica simply stops during import via Import or ReadList function. My idea is to split the input file into several smaller files that could be imported and to load the smaller files in a batch to feed the Predict function or perhabs a neural network. Any idea how to make it happen? Do you have a better idea?

Many thanks in advance for support!

POSTED BY: Jürgen Kanz
11 Replies

Thank You again to all participants for their contributions. After some trials I have found an efficient way to insert the big csv file in MongoDB via mongoimport.

The already above mentioned reference page is excellent to get all needed Information to connect Mathematica to MongoDB and to make use of data which do not fit into memory.https://reference.wolfram.com/language/tutorial/NeuralNetworksLargeDatasets.html

POSTED BY: Jürgen Kanz

Another possibility is to use a Mongo database --- as it is described in the same link I gave above.

POSTED BY: Wolfgang Hitzl

I am sorry for late response. Well, I have to admit that I am not a MongoDB expert. It is not possible for me to import the entire Csv file in Mongo. The import stops after ca. 1% (5,5 * 10^6) datasets. Now I am trying to parse CSV to Json, and I hope the Mongo import will lead to success with Json.

Thanks again for support.

POSTED BY: Jürgen Kanz

I suggest to use a generating function for training of large data sets, as it is described here

Training on large data sets

POSTED BY: Wolfgang Hitzl

Thank you for the hint.

POSTED BY: Jürgen Kanz
Posted 5 years ago

I suggest to use a generating function for training of large data sets, as it is described here

Arthrozene review

I actually just made this account to say thanks for posting this - it really helped me and I thought it deserved some recognition!

POSTED BY: Jojen Bourgain
Posted 5 years ago

Is it possible for Mathematica to build a ML model based on a stream of data?

See my answer to this question.

POSTED BY: Rohit Namjoshi

That is very useful! Thank you very much. It seems that I can solve the problem with a database based stream of data and a neural network. I will make a try.

POSTED BY: Jürgen Kanz

I don't know, but I know Mathematica can take advantage of Hadoop and MapReduce.

I uploaded my data to Mysql and than just created views in MySQL and read them from Mathematica

Good hint, thank you. I am currently trying to import the data into PostgreSQL. This approach means that the Predict function or a neural network would get the Data in a stream with the consequence that not all data could be available in one moment of time (due to limited RAM).

Is it possible for Mathematica to build a ML model based on a stream of data?

POSTED BY: Jürgen Kanz
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract