How to run a loop function for a list of hyperlinks

Posted 4 years ago

Hello, I have written several codes to get the output data for one hyperlink. However, I would like to run these codes for 500 hyperlinks. Each of these hyperlinks is a 10-K document in a text(.htm) format. The codes I wrote import this hyperlink and analyze a few things. My question is, I have assembled a list of 500 hyperlinks for Wolfram to import, but I am not sure how to set up a loop function so that Wolfram conducts my code 500 times for the 500 different hyperlinks.

So far, this is what I have:

importfile = Drop[Import["importfile.xlsx"][[1, All, 4]], 1];
cleanproxy = Flatten@importfile;
Cases[Import[cleanproxy[[1]], "Hyperlinks"], StringContainsQ[""]]

While my main code is below these three lines, I am stuck as to how to get WM to conduct the main code for the 500 hyperlinks that I feed to WM. I currently have over 30 main code lines and I wasn't sure if I could still use the Do function to fit all 30 codes into one Do function if possible. I presume that I need to set up something like f[x_] :=.... but I am stuck here.

Any help would truly be appreciated! Thank you in advance for your help!

POSTED BY: Young Il Baik
Posted 4 years ago

Hi Young,

Separating data gathering from data processing is a good practice to follow. Gather the data, save it, then feed it to your processing code. That way if you discover an issue with the processing code or want to change the processing algorithm you do not have to re-download the data.

If cleanproxy is a List of URL's then

importURL[url_] := Select[Import[url, "Hyperlinks"], StringContainsQ[""]]

data = AssociationMap[importer, cleanproxy]

The result is an Association from each URL to a list of the filtered hyperlinks. Try it on a small sample first

data = AssociationMap[importer, cleanproxy[[1;;3]]]

You may want to save the data at this point. Take a look at DumpSave in the documentation.

Then write a function containing the processing code. The function takes two arguments, the URL and the list of matching hyperlinks. Do whatever you need to do in the function. The last statement in the function should evaluate to the result.

process[url_, links_] := (* do something *)


KeyValueMap[process, data]
POSTED BY: Rohit Namjoshi
Posted 4 years ago

Thank you, Rohit for your reply and I truly appreciate it! I have a few follow-up questions if you wouldn't mind.

I see that your function "data = AssociationMap[importer, cleanproxy[[1;;3]]]" sets the URLs equal to the importers of the URLs. That is, it seems that "data" is now a string containing URL -> Importer[URL]. However, when I run "data" alone, I was expecting to see a text because each URL eventually takes me to a website with texts. WM did not import the URLs in the text format.

For example, when I write: process[url, links] := (y2019 = Length[Position[data, "2019"]];) where I don't touch "url" or "links", the function after := doesn't seem to apply to the texts for each of the hyperlinks but to the output for "data" which is URL ->importer[URL]

I hope I am making sense, but I just wanted to find a way for WM to click on each URL and import the text of each URL and run the functions inside the process.

Please let me know I am unclear on any of my questions. Thank you once again for your help! Below is the code I have:

importfile = Drop[Import["proxy2020sample.xlsx"][[1, All, 4]], 1];
cleanproxy = Flatten@importfile;

Cases[Import[cleanproxy[[1]], "Hyperlinks"], 
importURL[url_] := 
  Select[Import[url, "Hyperlinks"], 
data = AssociationMap[importer, cleanproxy];
data = AssociationMap[importer, cleanproxy[[1 ;; 3]]];

process[Hyperlinks_, importer_] := (data1 = Import[data, "Plaintext"]);

KeyValueMap[process, data]
POSTED BY: Young Il Baik
Posted 4 years ago

Hi Young,

If the data you want to process is the plaintext then import it during data gathering rather than data processing.

(* I noticed that some sites link to http: not https: so I added testing for either *)
importURL[url_] := Select[Import[url, "Hyperlinks"], 
  StringContainsQ["http" | "https" ~~ "://"]] //
   Map[Import[#, "Plaintext"] &];


data = AssociationMap[importURL, cleanproxy];

Not sure what you intend to do with the plaintext, a couple of simple examples

countWords[text_] := TextWords[#] & /* Length /@ text
Map[countWords, data]

wordCloud[text_] := WordCloud[#] & /@ text
Map[wordCloud, data]

Make sure your processing works correctly with a small sample first.

POSTED BY: Rohit Namjoshi
Posted 4 years ago

Almost identical question posted here.

POSTED BY: Rohit Namjoshi
