Message Boards Message Boards

Asynchronous Import (MultipleDownload function)

Suppose you want to download multiples files from the internet, perhaps data from a dictionary or something like that. Using multiples Import would be extremely slow since the download would be synchronous (one at a time).

You can use Mathematica function URLSubmit asynchronous request (multiples at a time). The only "drawback" is that you need to wait for the request to be done (you can still run part of your code while it is downloading).

With the simple code bellow, you can download multiple files with a given extension and parse the result, returning a list of data.

Options@MultipleDownload = {TimeConstraint -> 30};
MultipleDownload[urls:{__String}, fileType_String, parseFun_, OptionsPattern[]] := Module[{L = {}},
    Function[{url}, URLSubmit[url, HandlerFunctionsKeys -> {"Body"},
       HandlerFunctions -> <|"BodyReceived" -> (AppendTo[L, parseFun@ImportString[First@#1["Body"], fileType]] &)|>]
    ] /@ urls;
    TimeConstrained[While[Length@L < Length@urls,
       Pause[1];
       PrintTemporary[{Length@L, Length@urls}]
    ], OptionValue@TimeConstraint];

    L
]

Where it basicallyes make a call to URLSubmit, fetch the data, parse it and append to a temporary list L. Then wait until the list is full or we have run out of time. It does no error handling, so use with care.

Real world usage would be (getting the pronunciation respelling from dictionary.com):

words = {"pig", "people", "math", "car", "book"};
urls = StringTemplate["http://www.dictionary.com/browse/`1`"] /@ words;
data = MultipleDownload[urls, "XMLObject",
    Function[{xml},
      First@Cases[xml, a:XMLElement["span", {"class" -> "pron spellpron"}, {L__}] :>
      StringTrim@ImportString[ExportString[a, "XML"], "HTML"], \[Infinity]]
    ]
]
(* Output *)
{"[kahr]", "[pig]", "[ pee -p uh l]", "[math]", "[b oo k]"}

PS: The function posted is only for academic or personal usage. No content from the original website should ever be used for monetary gain or distributed.

POSTED BY: Thales Fernandes
2 Replies

I've used URLSaveAsynchronous to do similar. Then create a rudimentary dashboard of active jobs with the below.

Dynamic[Refresh[AsynchronousTasks[]//Column,UpdateInterval->1]]

POSTED BY: Vincent Virgilio

I've used URLSaveAsynchronous to do similar. Then create a rudimentary dashboard of active jobs with the below.

My first go at this "problem" was to use it too. But I needed to download hundreds or even thousands of HTML only to get minimal data from it. So, to save it and parse every single file was a waste of space, thus I decided to just parse the data on memory and save it later.

POSTED BY: Thales Fernandes
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract