Group Abstract Group Abstract

Message Boards Message Boards

0
|
4.1K Views
|
4 Replies
|
1 Total Like
View groups...
Share
Share this post:

How to run a loop function for a list of hyperlinks

Posted 5 years ago
POSTED BY: Young Il Baik
4 Replies
Posted 5 years ago

Almost identical question posted here.

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Thank you, Rohit for your reply and I truly appreciate it! I have a few follow-up questions if you wouldn't mind.

I see that your function "data = AssociationMap[importer, cleanproxy[[1;;3]]]" sets the URLs equal to the importers of the URLs. That is, it seems that "data" is now a string containing URL -> Importer[URL]. However, when I run "data" alone, I was expecting to see a text because each URL eventually takes me to a website with texts. WM did not import the URLs in the text format.

For example, when I write: process[url, links] := (y2019 = Length[Position[data, "2019"]];) where I don't touch "url" or "links", the function after := doesn't seem to apply to the texts for each of the hyperlinks but to the output for "data" which is URL ->importer[URL]

I hope I am making sense, but I just wanted to find a way for WM to click on each URL and import the text of each URL and run the functions inside the process.

Please let me know I am unclear on any of my questions. Thank you once again for your help! Below is the code I have:

importfile = Drop[Import["proxy2020sample.xlsx"][[1, All, 4]], 1];
importfile;
cleanproxy = Flatten@importfile;



Cases[Import[cleanproxy[[1]], "Hyperlinks"], 
  StringContainsQ["https://www.sec.gov/"]];
importURL[url_] := 
  Select[Import[url, "Hyperlinks"], 
   StringContainsQ["https://www.sec.gov/"]];
data = AssociationMap[importer, cleanproxy];
data = AssociationMap[importer, cleanproxy[[1 ;; 3]]];


process[Hyperlinks_, importer_] := (data1 = Import[data, "Plaintext"]);

KeyValueMap[process, data]
POSTED BY: Young Il Baik
Posted 5 years ago

Hi Young,

If the data you want to process is the plaintext then import it during data gathering rather than data processing.

(* I noticed that some sites link to http: not https: so I added testing for either *)
importURL[url_] := Select[Import[url, "Hyperlinks"], 
  StringContainsQ["http" | "https" ~~ "://www.sec.gov/"]] //
   Map[Import[#, "Plaintext"] &];

Then

data = AssociationMap[importURL, cleanproxy];

Not sure what you intend to do with the plaintext, a couple of simple examples

countWords[text_] := TextWords[#] & /* Length /@ text
Map[countWords, data]

wordCloud[text_] := WordCloud[#] & /@ text
Map[wordCloud, data]

Make sure your processing works correctly with a small sample first.

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Hi Young,

Separating data gathering from data processing is a good practice to follow. Gather the data, save it, then feed it to your processing code. That way if you discover an issue with the processing code or want to change the processing algorithm you do not have to re-download the data.

If cleanproxy is a List of URL's then

importURL[url_] := Select[Import[url, "Hyperlinks"], StringContainsQ["https://www.sec.gov/"]]

data = AssociationMap[importer, cleanproxy]

The result is an Association from each URL to a list of the filtered hyperlinks. Try it on a small sample first

data = AssociationMap[importer, cleanproxy[[1;;3]]]

You may want to save the data at this point. Take a look at DumpSave in the documentation.

Then write a function containing the processing code. The function takes two arguments, the URL and the list of matching hyperlinks. Do whatever you need to do in the function. The last statement in the function should evaluate to the result.

process[url_, links_] := (* do something *)

Then

KeyValueMap[process, data]
POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard