Message Boards Message Boards

1
|
10704 Views
|
4 Replies
|
2 Total Likes
View groups...
Share
Share this post:

How can I throttle downloads with URLSaveAsynchronous?

Posted 9 years ago

Hello everyone, I need to download thousands of files from the Security and Exchange Commission's website. Access is through anonymous FTP with "anonymous" as the username and my email address as the password. I've been using urlSave in a loop but my script aborts with session-timed-out or cannot-connect-to-server types of errors after a few dozen, and sometimes a few hundred downloads. The SEC's webmaster tells me that "There is no load/rate limiting on FTP, but if you are running a fast process, it is possible you are temporarily overwhelming the server." So, I'm thinking that I need to throttle my requests, and maybe should be using URLSaveAsynchronous to check the status of the current file being downloaded and not request another file until the current download is complete. I also wonder whether I should be connecting to the FTP site only once with the username and password, loop my requests, and then close the connection. Again, I don't know how to do this in Mathematica.

Any tips or suggestions would be much appreciated.

Gregory

POSTED BY: Gregory Lypny
4 Replies
Posted 9 years ago

Hi Richard,

Thanks for the detailed tip! It's pretty involved. I'm going to study it and give it a whirl.

Thanks once again,

Gregory

POSTED BY: Gregory Lypny

Something like this should work:

(* Set this to the number of asynchronous downloads you want running at a time *)
tasks = 10;

(* Set these as needed *)
user     = "anonymous";
password = "password";

(* This will be your list of URLs that you need to download. I just used this to test. *)
urls = ConstantArray[ "http://exampledata.wolfram.com/USConstitution.txt", 100 ];

(* Choose a download location *)
SetDirectory @ CreateDirectory[];

(* If you'd like some notification when a download is starting, use something like this:
   alert = Print["downloading: ", #] &; *)

alert = Null &;

elements = { "statuscode", "progress", "error", "headers", "cookies", "data" };
initialStatus = AssociationMap[ {} &, elements ];

store = Append[ #, "status" -> "waiting" ] & /@
  Association[ MapIndexed[
      First @ #2 -> Prepend[ initialStatus, "url" -> #1 ] &,
      urls
  ] ];

callback // ClearAll;
callback[ async_, "data", data_ ] :=
  Module[ { key },
      key = "UserData" /. Options @ async;
      If[ data =!= { {} }
          ,
          store[ key ][ "data"   ] = data
          ,
          store[ key ][ "status" ] = "finished";
          Module[ { nextKey },

              nextKey = SelectFirst[
                  Keys @ store,
                  store[ #, "status" ] === "waiting" &
              ];

              If[ nextKey =!= Missing @ "NotFound",
                  startDownload @ nextKey
              ]
          ]
      ]
  ];

callback[ async_, tag_, contents_ ] :=
  Module[ { key },
      key = "UserData" /. Options @ async;
      store[ key ][ tag ] = contents;
  ];

startDownload // ClearAll;
startDownload[ i_ ] := (
    store[ i, "status" ] = "initialized";
    alert @ i;
    With[ { url = store[ i, "url" ] },

        URLSaveAsynchronous[ url,
                             ToString @ i <> "_" <> FileNameTake @ url,
                             callback,
                             "UserData" -> i,
                             "Username" -> user,
                             "Password" -> password
        ]
    ]
);

(* Start the downloads *)
startDownload /@ Range @ tasks;

(* View progress *)
Dynamic @ Counts @ store[[ All, "status" ]]
POSTED BY: Richard Hennigan

Have you tried just inserting Pause statements?

Typically, check whether each download succeeds and if it hasn't I run Pause for a short while. This is usually enough to prevent problems like this.

POSTED BY: Sean Clarke
Posted 9 years ago

Hi Sean, Thank you for responding. Not sure how to check whether a download succeeds. I could rig something with FileExistsQ, but I suspect there is a slicker way by checking the status of an url function such as URLSaveAsynchronous with its Progress option.

POSTED BY: Gregory Lypny
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract