Message Boards Message Boards


Importing big files (ImportSequential)

Posted 1 year ago
5 Replies
3 Total Likes

If you ever tried importing a really big file, you presumably had to reboot your computer soon after, since your computer probably became unresponsive.

Recently I had to import a really big file (over 1Gb). For this, I created the code below.

It import line by line of your file (assumed to be a text file). Pretty straightforward.

It would be nice if Mathematica could check if the file size is too big to be imported by conventional methods instead of freezing the computer. (And it's cumbersome to wrap a TimeConstrained or MemoryConstrained over since we silly humans forget things easily...)

ImportSequantial[file_String] /; FileExistsQ@file := Module[{stream, lines = {}, l, i=0},
    stream = OpenRead@file;

       l = Read[file, Record];
       lines = {lines, l}; (* One of the fastest way of adding a new element to a list that I know *)

5 Replies

Indeed it would be interesting to add a parameter in the script to check the file size before importing it. If the file is too big an error message appears indicating to reduce the file otherwise the loading is done.

It is obviously useful to always remember that there are technical limits!


If you try the following piece of code you will get a nice error

The current computation was aborted because there was insufficient memory available to complete the computation.

Table[i, {i, 10^14}];

So this error handling is built-in for some functions but not for all.

Is there any evidence that this is any less memory hungry than the considerably faster ReadList[file, Record]?

According to a quick test with a 100 MB file, ReadList is both faster and more memory efficient.

SetAttributes[bench, HoldAll];
bench[expr_] := Module[{res, time, mem},
  mem = MaxMemoryUsed[{time, res} = AbsoluteTiming[expr]];
  Print["Memory: ", mem, " Timing: ", time];

r1 = bench@ImportSequential["/var/log/install.log"];

Memory: 305939752 Timing: 2.90904

r2 = bench@ReadList["/var/log/install.log", Record];

Memory: 169102432 Timing: 0.705945

Edit: Forgot to reply...

My first try was using ReadList, but the files were so massive that I didn't know if anything was being done, or if the computer froze.

My goal was to get some feedback on the import status. A new version of this function is shown below.

With this is possible to get an update status (which can be very helpful!). My files can take over 1 min to import, hence is of utmost importance to know how "long" it will take.

Options@ImportSequantial = {BatchSize -> 1024};
ImportSequantial[file_String, OptionsPattern[]] /; FileExistsQ@file:= Module[{stream, fsize = FileByteCount@file, l, lines = {}},
    stream = OpenRead@file;
    PrintTemporary@ProgressIndicator[Dynamic@ByteCount@lines, {0, fsize}];
       l = ReadList[stream, Record, OptionValue@BatchSize];
       If[Length@l == 0, Break[]];
       lines = {lines, l};

Importing a 1Gb file we have using your benchmark:

Memory: 1350468128 Timing: 17.455
$Aborted (More than 5 min and the kernel was with >4Gb of memory!)

Where I took the precaution of quitting the kernel before each evaluation. ReadList was so slow I thought it was never going to finish (no patience). Hence my first post.

With this new improved version we can import by batches, which is pretty fast and we can get a feedback as a status bar (an added bonus).

For monitoring progress, it works great!

As for memory use, I think ReadList is the best choice right now. In principle, it should be possible to implement a more memory efficient reader in C provided that you know the datatypes in advance (e.g. all numeric, then use packed arrays).

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract