Message Boards Message Boards

Importing big files (ImportSequential)

GROUPS:

If you ever tried importing a really big file, you presumably had to reboot your computer soon after, since your computer probably became unresponsive.

Recently I had to import a really big file (over 1Gb). For this, I created the code below.

It import line by line of your file (assumed to be a text file). Pretty straightforward.

It would be nice if Mathematica could check if the file size is too big to be imported by conventional methods instead of freezing the computer. (And it's cumbersome to wrap a TimeConstrained or MemoryConstrained over since we silly humans forget things easily...)

ImportSequantial[file_String] /; FileExistsQ@file := Module[{stream, lines = {}, l, i=0},
    stream = OpenRead@file;
    PrintTemporary@Dynamic@i;

    While[True,
       l = Read[file, Record];
       If[l===EndOfFile,
         Break[]
       ];
       lines = {lines, l}; (* One of the fastest way of adding a new element to a list that I know *)
       i++;
    ]; 
    Close@stream;

    Flatten@lines
]
POSTED BY: Thales Fernandes
Answer
16 days ago

Indeed it would be interesting to add a parameter in the script to check the file size before importing it. If the file is too big an error message appears indicating to reduce the file otherwise the loading is done.

It is obviously useful to always remember that there are technical limits!

POSTED BY: Marie Lejendre
Answer
16 days ago

Yeup.

If you try the following piece of code you will get a nice error

The current computation was aborted because there was insufficient memory available to complete the computation.

Table[i, {i, 10^14}];

So this error handling is built-in for some functions but not for all.

POSTED BY: Thales Fernandes
Answer
16 days ago

Is there any evidence that this is any less memory hungry than the considerably faster ReadList[file, Record]?

According to a quick test with a 100 MB file, ReadList is both faster and more memory efficient.

SetAttributes[bench, HoldAll];
bench[expr_] := Module[{res, time, mem},
  mem = MaxMemoryUsed[{time, res} = AbsoluteTiming[expr]];
  Print["Memory: ", mem, " Timing: ", time];
  res
  ]

r1 = bench@ImportSequential["/var/log/install.log"];

Memory: 305939752 Timing: 2.90904

r2 = bench@ReadList["/var/log/install.log", Record];

Memory: 169102432 Timing: 0.705945
POSTED BY: Szabolcs Horvát
Answer
16 days ago

Edit: Forgot to reply...

My first try was using ReadList, but the files were so massive that I didn't know if anything was being done, or if the computer froze.

My goal was to get some feedback on the import status. A new version of this function is shown below.

With this is possible to get an update status (which can be very helpful!). My files can take over 1 min to import, hence is of utmost importance to know how "long" it will take.

Options@ImportSequantial = {BatchSize -> 1024};
ImportSequantial[file_String, OptionsPattern[]] /; FileExistsQ@file:= Module[{stream, fsize = FileByteCount@file, l, lines = {}},
    stream = OpenRead@file;
    PrintTemporary@ProgressIndicator[Dynamic@ByteCount@lines, {0, fsize}];
    While[True,
       l = ReadList[stream, Record, OptionValue@BatchSize];
       If[Length@l == 0, Break[]];
       lines = {lines, l};
    ];
    Close@stream;
    Flatten@lines
]

Importing a 1Gb file we have using your benchmark:

ImportSequantial
Memory: 1350468128 Timing: 17.455
ReadList
$Aborted (More than 5 min and the kernel was with >4Gb of memory!)

Where I took the precaution of quitting the kernel before each evaluation. ReadList was so slow I thought it was never going to finish (no patience). Hence my first post.

With this new improved version we can import by batches, which is pretty fast and we can get a feedback as a status bar (an added bonus).

POSTED BY: Thales Fernandes
Answer
16 days ago

For monitoring progress, it works great!

As for memory use, I think ReadList is the best choice right now. In principle, it should be possible to implement a more memory efficient reader in C provided that you know the datatypes in advance (e.g. all numeric, then use packed arrays).

POSTED BY: Szabolcs Horvát
Answer
15 days ago

Group Abstract Group Abstract