Message Boards Message Boards

Is there a best way to append data to a file?

GROUPS:

Hi everyone,

I've written a function that returns a list, and I have to loop the function over a big sample. If I loop through all of my sample data, the result would be an array of about 10 million rows and about 30 columns. I know from past experience that a procedure such as this will slow as the array grows, so instead, I will run the procedure on subsamples of the data, creating many smaller output arrays, saving each to a file in turn, and clearing them as I go. What I am grappling with now is how to append those smaller arrays to a single file and, ideally, merge them into one big array. Here's what I came up with using three arrays as pretend output. My approach is clunky, and my question is whether there is a better way.

Here are the output arrays, all of which are 4 x 3. m1 contains real numbers, m2 strings, and m3 integers just to be able to distinguish them easily.

SeedRandom[666]
m1 = RandomReal[1, {4, 3}]
m2 = RandomChoice[CharacterRange["A", "Z"], {4, 3}]
m3 = RandomInteger[100, {4, 3}]
m = {m1, m2, m3};

Create a file to save them. dumpPath is just the file path to my desktop.

tmpFile = FileNameJoin[{dumpPath, "tmp.nb"}]
CreateFile[tmpFile];

Append the arrays to the file using PutAppend.

Map[PutAppend[#, tmpFile] &, m]

Of course, by using PutAppend, the arrays are appended as separate expressions to the same cell in the notebook file, but are not merged or joined into one array. That would be nice, but I don't know how to do that, so after the procedure is done, I have to read the file back into Mathematica to join the arrays. I do this using ReadList.

readData = ReadList[tmpFile]

This gives me a list containing the three arrays. (It will be a huge list when I run it on my real sample.)

Finally, the three arrays can be merged into one using Flatten and then saved to another notebook file (step not shown).

Flatten[readData, 1]

I'm interested in other approaches, especially one that would append the output to the file and join it (build a single array) as it goes. Any tips would be much appreciated.

Regards,

Greg

POSTED BY: Gregory Lypny
Answer
23 days ago

Hi Gregory,

You need a minor change in the code for achieving the desired result:

SeedRandom[666]
m1 = RandomReal[1, {4, 3}]
m2 = RandomChoice[CharacterRange["A", "Z"], {4, 3}]
m3 = RandomInteger[100, {4, 3}]
m = Sequence[m1, m2, m3];

tmpFile = FileNameJoin[{dumpPath, "tmp.m"}]

PutAppend[m, tmpFile] 

readData = ReadList[tmpFile]

Now readData contains already flattened array. Note that the correct extensions for files containing Mathematica expressions intended for loading using Get, ReadList or Import are .m and .wl, the extension .nb is for Mathematica notebooks.

POSTED BY: Alexey Popkov
Answer
23 days ago

Hi Alexey,

Thanks for your suggestion, but I must be doing something wrong. I replaced {…} with Sequence and changed the extension of the export file to .m, as in your code, but ReadList[tmpFile] still returns a list of lists. It seems that Sequence has eliminated the need for Map.

I any case, I should have been clearer in my example. I won't have all of the matrices—m1, m2, and m3—available all at once for appending to a file. They will be appended and then deleted or cleared in turn as soon as they are produced in order to conserve memory. So, what I'd like to accomplish is to flatten the contents of the file every time that one of the matrices is appended: flatten as I go.

Greg

POSTED BY: Gregory Lypny
Answer
22 days ago

Greg,

You should be able to handle 10 million x 30 arrays in Mathematica without breaking them up. I tried:

In[8]:= Timing[bigarray = RandomReal[1, {10000000, 30}];]

Out[8]= {2.75571, Null}

It only took 2.7 seconds to generate a random array of that size and Mathematica was fine with it. There are several things you can do to optimize this. 1. do not print the arrays -- use a semicolon so they are not displayed. 2. Stay away from Do and other looping constructs. Use the list functions such as Table, Map, etc. 3. Write the data (whether or not you break it up) into a binary file such as .mat or use the Binary read and write functions.

Note if you want to run your functions over subsets of the bigarray, you can do that by using the Part functionality and still keep the array as one big array for writing to your file later. For example

bigarray[[1 ;; 5]] = RandomReal[10, {5, 30}]

Will replace the first 5x30 array elements with new numbers ranging from 0 to 10. This is done in place so you still have one big array but can process it in "chunks".

I hope this helps.

Regards,

Neil

POSTED BY: Neil Singer
Answer
22 days ago

Hi Neil,

Thanks for the tips. I do, in fact, avoid Do and instead use Map or Table. The big array I refer to is the result of many computations from a correspondingly big dataset. Each row of the array will be the output of one computation; in that way the array is being built up, and my experience has been that repeated evaluation of a function can cause a process to slow to a crawl. I do things like $HistoryLength = 2 to mitigate that.

Greg

POSTED BY: Gregory Lypny
Answer
15 days ago

Write[channel,Subscript[expr, 1],Subscript[expr, 2],[Ellipsis]] writes the expressions Subscript[expr, i] in sequence, followed by a newline, to the specified output channel.

The output channel used by Write can be a single file or pipe, or a list of them, each specified as a string "name", as File["name"], or as an OutputStream object.


i may not understand the initial question. but Write appends upon each invocation, is efficient, and can write expressions to file/stream

if that's so, the PutAppend is not the only option

POSTED BY: John Hendrickson
Answer
18 days ago

The Write command works well, a good alternative to PutAppend.

Thanks, John.

Greg

POSTED BY: Gregory Lypny
Answer
15 days ago

Group Abstract Group Abstract