Hi Young,
Separating data gathering from data processing is a good practice to follow. Gather the data, save it, then feed it to your processing code. That way if you discover an issue with the processing code or want to change the processing algorithm you do not have to re-download the data.
If cleanproxy
is a List
of URL's then
importURL[url_] := Select[Import[url, "Hyperlinks"], StringContainsQ["https://www.sec.gov/"]]
data = AssociationMap[importer, cleanproxy]
The result is an Association
from each URL to a list of the filtered hyperlinks. Try it on a small sample first
data = AssociationMap[importer, cleanproxy[[1;;3]]]
You may want to save the data at this point. Take a look at DumpSave
in the documentation.
Then write a function containing the processing code. The function takes two arguments, the URL and the list of matching hyperlinks. Do whatever you need to do in the function. The last statement in the function should evaluate to the result.
process[url_, links_] := (* do something *)
Then
KeyValueMap[process, data]