Group Abstract Group Abstract

Message Boards Message Boards

0
|
11.6K Views
|
6 Replies
|
2 Total Likes
View groups...
Share
Share this post:

Need Help With Speed or Memory Problem

Posted 11 years ago

Hello everyone,

I need help with a speed or memory management problem. I have a small function entirely wrapped in Module. It does the following:

  • Imports the names of small text files within a folder (thousands of them)
  • Imports the files in batches of 100, extracts some information from them, and then exports the extracted information in a single CSV file

If there are 14,000 files in the source folder, the Do loop in my module will import 140 batches of 100 files each and then export 140 CSV files. There are no global variables set within the module, and there are no variables that are accumulating data; each import overwrites the data of the previous one.

The problem I notice while testing the script is that every time I run it within a given Mathematica session, the time taken to complete a batch of 100 files would grow a lot. It takes 3 to 5 seconds per batch the first time I run it, and that grows to more than 50 seconds by the 7th or 8th time I run it. When I quit Mathematica and restart it, the function is fast again the first time I run it. Also, for a given run, each new batch within the Do loop takes longer than the previous one.

Update (1). Using Block instead of Module speeds things up but not by much.

Update (2). After more experimentation, my best guess is that Mathematica seems to get bogged down when a (custom) function is called repeatedly. Could it be that Mathematica somehow tracks the variables called, and that this requires an ever-increasing amount of memory?

What is going on? Any advice would be most appreciated.

Gregory

POSTED BY: Gregory Lypny
6 Replies
Posted 11 years ago

Thanks once again, Hans,

This is opportune as I am in the process of breaking down my code into separate functions: one for company info and the other for beneficial ownership, for example. I will mix some of your code into mine and see what effect it has on speed.

Gregory

POSTED BY: Gregory Lypny

Gregory:

I was looking at your previous code where you process the company information segment. I believe that you may need to rewrite your code; particular this segment

   companyInfoArray = 
     Select[StringTrim /@ 
       StringSplit[
        Select[ReadList[StringToStream[companyInfoRaw], String], 
         StringMatchQ[#, ___ ~~ ":" ~~ ___] && ! 
            StringMatchQ[#, "<" ~~ ___] &], ":"], 
      Length[#] == 2 && #[[2]] =!= "" &];

The StringToStream[] is not ever Close[] (close stream). So set the stream (give it a handle) then Close[], so to something like this:

companyInfoStream = StringToStream[companyInfoRaw];
companyInfoArray = 
  Select[StringTrim /@ 
    StringSplit[
     Select[ReadList[companyInfoStream, String], 
      StringMatchQ[#, ___ ~~ ":" ~~ ___] && ! 
         StringMatchQ[#, "<" ~~ ___] &], ":"], 
   Length[#] == 2 && #[[2]] =!= "" &];
Close[companyInfoStream];

At least it may help with memory management. I would start there. Hans

POSTED BY: Hans Michel
POSTED BY: Hans Michel

Greg:

If this is the same SEC Form DEF14A data processing then I have a few questions: Are you persisting the *.txt files to your file system? If so URLFetch is working well (would you say)? Then you go back and process each file using Import[] from local file store, Is this correct? Do you save your CSV files in the same directory as your source files (I hope not)?

If this is the SEC Index files, those file types are zipped (compress) so any auto decompress that Mathematica has to do is going to be memory intensive.

If this is the former situation and you are OK with URLFetch and persisting to file system, then I would not go back to Import[] when there are other functions that give you more control such as OpenRead[] and Close[] stream. For HTML document fragments to pull out tables the automated choice is to use ImportString[HTMLFragment, "Data"] , note that the older DEF14A files (year<2001) are more text files then HTML. I would advise that you stop batching, just process each file one at a time, use a bit of error tracking mark files that fail processing, and return to the errored list if any. 14,000 files ~ 4 hours at 1 second processing per file.

I also assumed that your were going to wrap the entire process in a Mathematica package (before you do that, work out the kinks).

POSTED BY: Hans Michel
Posted 11 years ago

Hi Hans,

Yes, I am downloading en masse and then processing locally. I suspect that the slowdown has to do with repeated calls to string functions that depend on patterns. I will play with OpenRead[] and Close[] stream as you suggest. I will also try to construct a generic code example that illustrates the problem as requested by Stefan.

Thanks again,

Gregory

POSTED BY: Gregory Lypny

Can you share more of your code with us? Without those specifics, it is very hard to diagnose what could be going wrong.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard