Message Boards Message Boards

How Can I Avoid Session Timed-Out in Import URL?

Posted 10 years ago

Hello everyone,

I am writing scripts to download data from public websites. I use

Import[URL, "Plaintext"]

or

Import[URL, {"HTML","Data"}]

The scripts work fine as long as I don't make too many calls to the function that is doing the importing. When the import function is housed in a loop, I can make about 1,000 downloads, at which point the loop has been running for half an hour or longer, and then it either stalls or I get an error from Mathematica telling me that I need to check my internet connectivity because the session has timed out. I guess that means that I have overstayed my welcome at the host server.

I could break up my downloads into fewer calls. But is there anything else I can do to a avoid the session timeout and grab all the data I need in one swoop?

Any tips would be much appreciated,

Gregory

POSTED BY: Gregory Lypny
6 Replies
Posted 10 years ago

I believe the following is a related problem:

http://community.wolfram.com/groups/-/m/t/482724?ppauth=f3Gpyp9U

POSTED BY: Sandu Ursu
Posted 10 years ago

Well, I think I know why connections time-out: I think Mathematica gets bogged down whenever a function is called repeatedly. Each call to a function takes longer than the previous one. Have a look at my recent discussion Need Help With Speed or Memory Problem, where I described how a function I created to process many files on a local drive slows to a crawl. I have been trying to speed it up by using Block instead of Module, but it is not much help.

POSTED BY: Gregory Lypny
Posted 10 years ago

Excellent. Will give it a go. I'm also looking into tapping into the master indexes in the SEC ftp site.

POSTED BY: Gregory Lypny

Gregory:

I would also advise that you persist your data in a local data store, be it your file system or a database. So I would spend one pass getting all the internet data locally, then another pass processing this data. This way you divide data fetching from data processing. Your HTTPClient connections will be shorter and may reduce the possibility of connection or session timeouts.

The following code will import the links to Form DEF14A for a particular CIK from 1994 to this year using Import.

$HTTPCookies 
cik = 320193;
foundToYear[x_]:=Module[{foundstr,lyear},foundstr=StringCases[x,RegularExpression["\\-(\\d\\d)\\-"]->"$1"][[1]];
lyear=If[ToExpression[foundstr]>49,Plus[1900,ToExpression[foundstr]],Plus[2000,ToExpression[foundstr]]];
Return[lyear];];
DeleteDuplicates[Sort[Select[Import["http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D"<>IntegerString[ToExpression[cik],10,10]<>"+TYPE%3DDEF&first=1994&last="<>DateString[DateList[],"Year"],"Hyperlinks"],Function[StringMatchQ[#,"*.txt"]==True]],Function[foundToYear[#1]>foundToYear[#2]]]]
$HTTPCookies

The following code will import the links to Form DEF14A for a particular CIK from 1994 to this year using URLFetch.

$HTTPCookies 
cik = 320193;
foundToYear[x_]:=Module[{foundstr,lyear},foundstr=StringCases[x,RegularExpression["\\-(\\d\\d)\\-"]->"$1"][[1]];
lyear=If[ToExpression[foundstr]>49,Plus[1900,ToExpression[foundstr]],Plus[2000,ToExpression[foundstr]]];
Return[lyear];];
DeleteDuplicates[Sort[Select[ImportString[URLFetch["http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D"<>IntegerString[ToExpression[cik],10,10]<>"+TYPE%3DDEF&first=1994&last="<>DateString[DateList[],"Year"], "Cookies" -> False],{"HTML","Hyperlinks"}],Function[StringMatchQ[#,"*.txt"]==True]],Function[foundToYear[#1]>foundToYear[#2]]]]
$HTTPCookies 

The $HTTPCookies global variable remains empty in this example. However, side effects are the links returned in this version the base URL is lost . Also may want to throw in "StoreCookies"->False. Also look into the Help for "tutorial/InternetConnectivity"

The aim is to reduce the reuse of the same connection (In some cases a good thing) if you decide to process the data in the same run as fetching it. So if it take 90 seconds to process a particular CIK and the connection timeout for either HTTPClient, Proxy Server, or Web Server is shorter than 90 second, the next internet connection to same base URL may return an error. You want the HTTPClient and web server to treat each connection as a new connection.

Hans

POSTED BY: Hans Michel

Gregory:

Look into the following

$HTTPCookies  

Run the expression

$HTTPCookies

then, instead of a loop run your function once. Run the expression

$HTTPCookies 

again. There should be some data in the result.

I can't recall if the Import command supports control over http header information and cookies. But the

URLFetch

and

URLFetchAsynchronous 

do. In addition, these commands option give you more control over HTTP header data, including cookies, timeouts, etc. You may need to rewrite the Import commands to be

URLFetch

Hans

POSTED BY: Hans Michel
Posted 10 years ago

Hi Hans,

Thanks, as always. I'l look into URLFetch.

Gregory

POSTED BY: Gregory Lypny
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract