Message Boards

WOLFRAM COMMUNITY

10997 Views

2 Replies

4 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Speed up asynchronous web crawler?

Michael Hale

Posted 8 years ago

I made a simple web crawler, and I'm trying to speed it up by having it work on 10 asynchronous requests at time. As a test I'm using the caltech.edu domain and just saving the HTML for each page in memory. It works fine for several hundred pages and then the kernel crashes without warning. The responsiveness of the notebook is also pretty low while it is running. The synchronous version works fine but is slower. I was wondering if anyone has any tips or guidance. I'm being forced to switch to and learn Python's Scrapy for now. The ID number referred to is the ID for the school in the IPEDS government data files on colleges. << JLink` InstallJava[]; resolveURL[base_, url_] := JavaBlock[ JavaNew["java.net.URL", JavaNew["java.net.URL", base], url]@ toString[] ] id = 110404; homepage = "http://www.caltech.edu"; currentDomain = URLParse[homepage, "Domain"]; toVisit = {homepage}; visited = {}; results = {}; maxRequests = 10; currentRequests = 0; While[Length@toVisit > 0 \|\| currentRequests > 0, If[currentRequests < maxRequests && Length@toVisit > 0, ( current = First@toVisit; AppendTo[visited, current]; toVisit = Rest@toVisit; currentRequests++; Module[{current = current, links, html}, URLSubmit[current, HandlerFunctionsKeys -> {"StatusCode", "Body"}, HandlerFunctions -> <\|"TaskFinished" -> (( currentRequests--; If[#StatusCode == 200, html = First@#Body; links = ImportString[html, {"HTML", "Hyperlinks"}] // Map[Quiet@resolveURL[current, #] &] // Cases[_String]; toVisit = Join[ toVisit, links // Map@URLBuild(clean up trailing /'s)// Select[URLParse[#, "Domain"] == currentDomain && StringStartsQ[#, "http"] &] ] // Union // Complement[#, visited] &; AppendTo[results, {id, current, html}] ]) &)\|>] ] ), Pause@.1 ] ]; To watch while it runs I use the following: Dynamic@currentRequests Dynamic@Length@toVisit Dynamic@Length@visited Dynamic@Length@results Dynamic[Column@Take[toVisit, UpTo@10]] Dynamic[Column@Take[Reverse@visited, UpTo@10]]

I made a simple web crawler, and I'm trying to speed it up by having it work on 10 asynchronous requests at time. As a test I'm using the caltech.edu domain and just saving the HTML for each page in memory. It works fine for several hundred pages and then the kernel crashes without warning. The responsiveness of the notebook is also pretty low while it is running. The synchronous version works fine but is slower. I was wondering if anyone has any tips or guidance. I'm being forced to switch to and learn Python's Scrapy for now. The ID number referred to is the ID for the school in the IPEDS government data files on colleges.

<< JLink`
InstallJava[];
resolveURL[base_, url_] := JavaBlock[
  JavaNew["java.net.URL", JavaNew["java.net.URL", base], url]@
   toString[]
  ]

id = 110404;
homepage = "http://www.caltech.edu";
currentDomain = URLParse[homepage, "Domain"];
toVisit = {homepage};
visited = {};
results = {};
maxRequests = 10;
currentRequests = 0;

While[Length@toVisit > 0 || currentRequests > 0,
  If[currentRequests < maxRequests && Length@toVisit > 0,
   (
    current = First@toVisit;
    AppendTo[visited, current];
    toVisit = Rest@toVisit;
    currentRequests++;
    Module[{current = current, links, html},
     URLSubmit[current, 
      HandlerFunctionsKeys -> {"StatusCode", "Body"}, 
      HandlerFunctions -> <|"TaskFinished" -> ((
            currentRequests--;
            If[#StatusCode == 200,
             html = First@#Body;

             links = 
              ImportString[html, {"HTML", "Hyperlinks"}] // 
                Map[Quiet@resolveURL[current, #] &] // Cases[_String];
             toVisit = Join[
                 toVisit,
                 links //
                   Map@URLBuild(*clean up trailing /'s*)//

                  Select[URLParse[#, "Domain"] == currentDomain && 
                    StringStartsQ[#, "http"] &]
                 ] //
                Union //
               Complement[#, visited] &;
             AppendTo[results, {id, current, html}]
             ]) &)|>]
     ]
    ),
   Pause@.1
   ]
  ];

To watch while it runs I use the following:

Dynamic@currentRequests

Dynamic@Length@toVisit

Dynamic@Length@visited

Dynamic@Length@results

Dynamic[Column@Take[toVisit, UpTo@10]]

Dynamic[Column@Take[Reverse@visited, UpTo@10]]

POSTED BY: Michael Hale

2 Replies

Sort By:

Sam Carrettie

Sam Carrettie, Freelancer

Posted 8 years ago

Could not run it on 11. Is `toString` defined? It stays blue...

POSTED BY: Sam Carrettie

Michael Hale

Posted 8 years ago

It works for me if I copy and paste into a new v11 session on Windows 10, although I have to abort the computation and evaluate it again because for some reason the first URLSubmit never returns on the first evaluation. That toString should stay blue. It is using JLink to convert the undefined symbol into the Java toString() method on a Java URL. The Java URL class handles resolving relative URLs on webpages (like if I'm on http://www.google.com/abc/def.html and it contains a link to ../search.html that should resolve to http://www.google.com/search.html). I couldn't find a built in way to do this in Mathematica (although they do resolve relative URLs if you do Import[url,"Hyperlinks"]), so I used Java's. It's a pretty simple function to write yourself, but I was sure I would not handle trailing slashes properly or something. You can test it with resolveURL["http://www.google.com/abc/def.html", "../search.html"]

POSTED BY: Michael Hale

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback