I made a simple web crawler, and I'm trying to speed it up by having it work on 10 asynchronous requests at time. As a test I'm using the caltech.edu domain and just saving the HTML for each page in memory. It works fine for several hundred pages and then the kernel crashes without warning. The responsiveness of the notebook is also pretty low while it is running. The synchronous version works fine but is slower. I was wondering if anyone has any tips or guidance. I'm being forced to switch to and learn Python's Scrapy for now. The ID number referred to is the ID for the school in the IPEDS government data files on colleges.
<< JLink`
InstallJava[];
resolveURL[base_, url_] := JavaBlock[
JavaNew["java.net.URL", JavaNew["java.net.URL", base], url]@
toString[]
]
id = 110404;
homepage = "http://www.caltech.edu";
currentDomain = URLParse[homepage, "Domain"];
toVisit = {homepage};
visited = {};
results = {};
maxRequests = 10;
currentRequests = 0;
While[Length@toVisit > 0 || currentRequests > 0,
If[currentRequests < maxRequests && Length@toVisit > 0,
(
current = First@toVisit;
AppendTo[visited, current];
toVisit = Rest@toVisit;
currentRequests++;
Module[{current = current, links, html},
URLSubmit[current,
HandlerFunctionsKeys -> {"StatusCode", "Body"},
HandlerFunctions -> <|"TaskFinished" -> ((
currentRequests--;
If[#StatusCode == 200,
html = First@#Body;
links =
ImportString[html, {"HTML", "Hyperlinks"}] //
Map[Quiet@resolveURL[current, #] &] // Cases[_String];
toVisit = Join[
toVisit,
links //
Map@URLBuild(*clean up trailing /'s*)//
Select[URLParse[#, "Domain"] == currentDomain &&
StringStartsQ[#, "http"] &]
] //
Union //
Complement[#, visited] &;
AppendTo[results, {id, current, html}]
]) &)|>]
]
),
Pause@.1
]
];
To watch while it runs I use the following:
Dynamic@currentRequests
Dynamic@Length@toVisit
Dynamic@Length@visited
Dynamic@Length@results
Dynamic[Column@Take[toVisit, UpTo@10]]
Dynamic[Column@Take[Reverse@visited, UpTo@10]]