Message Boards Message Boards

0
|
9653 Views
|
6 Replies
|
2 Total Likes
View groups...
Share
Share this post:

Need Help With Speed or Memory Problem

Posted 9 years ago

Hello everyone,

I need help with a speed or memory management problem. I have a small function entirely wrapped in Module. It does the following:

  • Imports the names of small text files within a folder (thousands of them)
  • Imports the files in batches of 100, extracts some information from them, and then exports the extracted information in a single CSV file

If there are 14,000 files in the source folder, the Do loop in my module will import 140 batches of 100 files each and then export 140 CSV files. There are no global variables set within the module, and there are no variables that are accumulating data; each import overwrites the data of the previous one.

The problem I notice while testing the script is that every time I run it within a given Mathematica session, the time taken to complete a batch of 100 files would grow a lot. It takes 3 to 5 seconds per batch the first time I run it, and that grows to more than 50 seconds by the 7th or 8th time I run it. When I quit Mathematica and restart it, the function is fast again the first time I run it. Also, for a given run, each new batch within the Do loop takes longer than the previous one.

Update (1). Using Block instead of Module speeds things up but not by much.

Update (2). After more experimentation, my best guess is that Mathematica seems to get bogged down when a (custom) function is called repeatedly. Could it be that Mathematica somehow tracks the variables called, and that this requires an ever-increasing amount of memory?

What is going on? Any advice would be most appreciated.

Gregory

POSTED BY: Gregory Lypny
6 Replies
Posted 9 years ago

Thanks once again, Hans,

This is opportune as I am in the process of breaking down my code into separate functions: one for company info and the other for beneficial ownership, for example. I will mix some of your code into mine and see what effect it has on speed.

Gregory

POSTED BY: Gregory Lypny

Gregory:

I was looking at your previous code where you process the company information segment. I believe that you may need to rewrite your code; particular this segment

   companyInfoArray = 
     Select[StringTrim /@ 
       StringSplit[
        Select[ReadList[StringToStream[companyInfoRaw], String], 
         StringMatchQ[#, ___ ~~ ":" ~~ ___] && ! 
            StringMatchQ[#, "<" ~~ ___] &], ":"], 
      Length[#] == 2 && #[[2]] =!= "" &];

The StringToStream[] is not ever Close[] (close stream). So set the stream (give it a handle) then Close[], so to something like this:

companyInfoStream = StringToStream[companyInfoRaw];
companyInfoArray = 
  Select[StringTrim /@ 
    StringSplit[
     Select[ReadList[companyInfoStream, String], 
      StringMatchQ[#, ___ ~~ ":" ~~ ___] && ! 
         StringMatchQ[#, "<" ~~ ___] &], ":"], 
   Length[#] == 2 && #[[2]] =!= "" &];
Close[companyInfoStream];

At least it may help with memory management. I would start there. Hans

POSTED BY: Hans Michel

Gregory:

You had provided some sample code a few months ago and I would concur that the companyInfo process which contains pattern dependent string functions may be the source of some issues. (previous post) From the amount of files that you are processing I would say that you are starting with about 1,400 CIK codes and process the SEC Header information in the Form DEF14A to get company location (state) and SIC. Please note that this header information is also available as an SGML file at the cost of another URLFetch. So for each Form DEF14A there will be a corresponding SGML header file whose header filename is in the content of each Form DEF14A file. This file is much easier to process. However, doing this for 1400 CIK calling ~10 Form DEFA then ~10 SGML headers would increase the number of web connections and files to save. The other approach is to process the data in the file which you are doing. You are using the ":" (colon) as a record delimiter. This is also what I did in the past also, but I used simple String Splits and did not use ReadList. In addition, I also pre-processed the string to the left of the ":" and brought down any text line with just whitespace on the right of the ":" with the next line below. Here is older code with SGML header processing:

   getBeneficialfromSECDEF14A[cik_]:=Module[{formDEF14A,tablestartpos,tablestartnearfunc,tableendpos,tableendnearfunc,benpos,tablestartnearest,tableendnearest,tablestartcommon,tableendcommon,bentablestart,bentableend,bentable,formLinks,sgmlHeaderFileName,sgmlHeaderpos,sgmlHeaderStubURL,sgmlHeaderURL,sgmlHeaderData,sgmlHeaderList,hd,sgmlTextStr,sgmlTextStartpos,sgmlTextEndpos,SICFormValue},processSECHeader[HeaderData_]:=Module[{HeaderDataStr,HeaderDataStream,HeaderDataList},HeaderDataStr=StringReplace[HeaderData,{"<SEC-HEADER>"->"","<TYPE>"->"","<PUBLIC-DOCUMENT-COUNT>"->"","<FILER>"->"","<COMPANY-DATA>"->"","</COMPANY-DATA>"->"","<FILING-VALUES>"->"","</FILING-VALUES>"->"","<BUSINESS-ADDRESS>"->"","</BUSINESS-ADDRESS>"->"","<MAIL-ADDRESS>"->"","</MAIL-ADDRESS>"->"","</FILER>"->"","</SEC-HEADER>"->"","<FORM-TYPE>DEF 14A"->"","<ACT>34"->"","<FORMER-COMPANY>"->"","</FORMER-COMPANY>"->"","<ACCEPTANCE-DATETIME>"->"AcceptanceDatetime|","<ACCESSION-NUMBER>"->"AccessionNumber|","<PERIOD>"->"ConformedPeriodOfReport|","<FILING-DATE>"->"FiledAsOfDate|","<DATE-OF-FILING-DATE-CHANGE>"->"DateAsOfChange|","<EFFECTIVENESS-DATE>"->"EffectivenessDate|","<CONFORMED-NAME>"->"CompanyConformedName|","<CIK>"->"CompanyCIK|","<ASSIGNED-SIC>"->"SICNumber|","<IRS-NUMBER>"->"IRSNumber|","<STATE-OF-INCORPORATION>"->"StateOfIncorporation|","<FISCAL-YEAR-END>"->"FiscalYearEnd|","<FILE-NUMBER>"->"FileNumber|","<FILM-NUMBER>"->"FilmNumber|","<STREET1>"->"AddressStreet1|","<STREET2>"->"AddressStreet2|","<CITY>"->"AddressCity|","<STATE>"->"AddressState|","<ZIP>"->"AddressZip|","<PHONE>"->"BusinessPhone|","<FORMER-CONFORMED-NAME>"->"FormerConformedName|","<DATE-CHANGED>"->"DateChanged|"}];
    HeaderDataStream=StringToStream[HeaderDataStr];
    HeaderDataList=ReadList[HeaderDataStream,String];
    Close[HeaderDataStream];
    Return[HeaderDataList];];
    foundToYear[x_]:=Module[{foundstr,lyear},foundstr=StringCases[x,RegularExpression["\\-(\\d\\d)\\-"]->"$1"][[1]];
    lyear=If[ToExpression[foundstr]>49,Plus[1900,ToExpression[foundstr]],Plus[2000,ToExpression[foundstr]]];
    Return[lyear];];
    getSICValue[formdata_]:=Module[{searchstring,SICPos,SICNumPos,SICValue=""},searchstring="STANDARD INDUSTRIAL CLASSIFICATION:";
    SICPos=StringPosition[formdata,searchstring,IgnoreCase->True];
    SICNumPos=StringPosition[formdata,RegularExpression["(STANDARD INDUSTRIAL CLASSIFICATION:)?(\\[\\d{4}\\])"]];
    SICValue=StringTrim[StringTake[formdata,{(Last[Flatten[SICPos]]+1),(First[Flatten[SICNumPos]]-1)}]];
    Return[SICValue];];
    formLinks=DeleteDuplicates[Sort[Select[Import["http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D"<>IntegerString[ToExpression[cik],10,10]<>"+TYPE%3DDEF&first=1994&last="<>DateString[DateList[],"Year"],"Hyperlinks"],Function[StringMatchQ[#,"*.txt"]==True]],Function[foundToYear[#1]>foundToYear[#2]]]];
    formDEF14A=Import[formLinks[[1]],"Plaintext"];
    SICFormValue=getSICValue[formDEF14A];
    tablestartpos=StringPosition[formDEF14A,"<table",IgnoreCase->True];
    tablestartnearfunc=Nearest[tablestartpos];
    tableendpos=StringPosition[formDEF14A,"</table>",IgnoreCase->True];
    tableendnearfunc=Nearest[tableendpos];
    benpos=StringPosition[formDEF14A,"beneficial",IgnoreCase->True];
    tablestartnearest=Flatten[Map[tablestartnearfunc,benpos],1];
    tableendnearest=Flatten[Map[tableendnearfunc,benpos],1];
    tablestartcommon=Commonest[tablestartnearest];
    tableendcommon=Commonest[tableendnearest];
    bentablestart=Min[tablestartcommon];
    bentableend=Min[tableendcommon];
    If[bentableend<bentablestart,(*find other table end*)bentableend=SelectFirst[tableendnearest[[All,2]],Function[Less[bentablestart,#]]];];
    bentable=ImportString[StringTake[formDEF14A,{bentablestart,bentableend}],{"HTML","Data"}];
    sgmlHeaderpos=StringPosition[formDEF14A,{"<SEC-HEADER>",".sgml"},2,Overlaps->False];
sgmlHeaderFileName=StringTake[formDEF14A,{Last[First[sgmlHeaderpos]]+1,Last[Last[sgmlHeaderpos]]}];
sgmlHeaderStubURL=StringReplacePart[formLinks[[1]],"",Last[StringPosition[formLinks[[1]],RegularExpression["(/).*([.]txt)"]]]];
sgmlHeaderURL=sgmlHeaderStubURL<>"/"<>sgmlHeaderFileName;sgmlHeaderData=Import[sgmlHeaderURL,"Text"];
    hd=processSECHeader[sgmlHeaderData];
    Return[{formLinks,hd,SICFormValue,bentable}];];

The code above still uses Import, which you may have change to URLFetch and now processing files locally. I will have to dig up the other process or rewrite.

I believe that your original issue was not with Import but with processing the SGML header to get the SIC. Nevertheless I have outlined a basic solution path to processing the companyinfo header in the Form DEF14A txt file. It is also not clear form your answer if you experienced any web issues once file were persisted locally ; did you? I do not wish to steer you wrong.

POSTED BY: Hans Michel

Greg:

If this is the same SEC Form DEF14A data processing then I have a few questions: Are you persisting the *.txt files to your file system? If so URLFetch is working well (would you say)? Then you go back and process each file using Import[] from local file store, Is this correct? Do you save your CSV files in the same directory as your source files (I hope not)?

If this is the SEC Index files, those file types are zipped (compress) so any auto decompress that Mathematica has to do is going to be memory intensive.

If this is the former situation and you are OK with URLFetch and persisting to file system, then I would not go back to Import[] when there are other functions that give you more control such as OpenRead[] and Close[] stream. For HTML document fragments to pull out tables the automated choice is to use ImportString[HTMLFragment, "Data"] , note that the older DEF14A files (year<2001) are more text files then HTML. I would advise that you stop batching, just process each file one at a time, use a bit of error tracking mark files that fail processing, and return to the errored list if any. 14,000 files ~ 4 hours at 1 second processing per file.

I also assumed that your were going to wrap the entire process in a Mathematica package (before you do that, work out the kinks).

POSTED BY: Hans Michel
Posted 9 years ago

Hi Hans,

Yes, I am downloading en masse and then processing locally. I suspect that the slowdown has to do with repeated calls to string functions that depend on patterns. I will play with OpenRead[] and Close[] stream as you suggest. I will also try to construct a generic code example that illustrates the problem as requested by Stefan.

Thanks again,

Gregory

POSTED BY: Gregory Lypny

Can you share more of your code with us? Without those specifics, it is very hard to diagnose what could be going wrong.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract