Message Boards Message Boards

Speed up asynchronous web crawler?

Posted 8 years ago

I made a simple web crawler, and I'm trying to speed it up by having it work on 10 asynchronous requests at time. As a test I'm using the caltech.edu domain and just saving the HTML for each page in memory. It works fine for several hundred pages and then the kernel crashes without warning. The responsiveness of the notebook is also pretty low while it is running. The synchronous version works fine but is slower. I was wondering if anyone has any tips or guidance. I'm being forced to switch to and learn Python's Scrapy for now. The ID number referred to is the ID for the school in the IPEDS government data files on colleges.

<< JLink`
InstallJava[];
resolveURL[base_, url_] := JavaBlock[
  JavaNew["java.net.URL", JavaNew["java.net.URL", base], url]@
   toString[]
  ]

id = 110404;
homepage = "http://www.caltech.edu";
currentDomain = URLParse[homepage, "Domain"];
toVisit = {homepage};
visited = {};
results = {};
maxRequests = 10;
currentRequests = 0;

While[Length@toVisit > 0 || currentRequests > 0,
  If[currentRequests < maxRequests && Length@toVisit > 0,
   (
    current = First@toVisit;
    AppendTo[visited, current];
    toVisit = Rest@toVisit;
    currentRequests++;
    Module[{current = current, links, html},
     URLSubmit[current, 
      HandlerFunctionsKeys -> {"StatusCode", "Body"}, 
      HandlerFunctions -> <|"TaskFinished" -> ((
            currentRequests--;
            If[#StatusCode == 200,
             html = First@#Body;

             links = 
              ImportString[html, {"HTML", "Hyperlinks"}] // 
                Map[Quiet@resolveURL[current, #] &] // Cases[_String];
             toVisit = Join[
                 toVisit,
                 links //
                   Map@URLBuild(*clean up trailing /'s*)//

                  Select[URLParse[#, "Domain"] == currentDomain && 
                    StringStartsQ[#, "http"] &]
                 ] //
                Union //
               Complement[#, visited] &;
             AppendTo[results, {id, current, html}]
             ]) &)|>]
     ]
    ),
   Pause@.1
   ]
  ];

To watch while it runs I use the following:

Dynamic@currentRequests

Dynamic@Length@toVisit

Dynamic@Length@visited

Dynamic@Length@results

Dynamic[Column@Take[toVisit, UpTo@10]]

Dynamic[Column@Take[Reverse@visited, UpTo@10]]
POSTED BY: Michael Hale
2 Replies

Could not run it on 11. Is toString defined? It stays blue...

POSTED BY: Sam Carrettie
Posted 8 years ago

It works for me if I copy and paste into a new v11 session on Windows 10, although I have to abort the computation and evaluate it again because for some reason the first URLSubmit never returns on the first evaluation.

That toString should stay blue. It is using JLink to convert the undefined symbol into the Java toString() method on a Java URL. The Java URL class handles resolving relative URLs on webpages (like if I'm on http://www.google.com/abc/def.html and it contains a link to ../search.html that should resolve to http://www.google.com/search.html). I couldn't find a built in way to do this in Mathematica (although they do resolve relative URLs if you do Import[url,"Hyperlinks"]), so I used Java's. It's a pretty simple function to write yourself, but I was sure I would not handle trailing slashes properly or something. You can test it with

resolveURL["http://www.google.com/abc/def.html", "../search.html"]
POSTED BY: Michael Hale
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract