Message Boards Message Boards

0
|
3447 Views
|
13 Replies
|
5 Total Likes
View groups...
Share
Share this post:

Storing output into a variable + looping function

Posted 3 years ago

Hello, I have a csv file or an excel that looks as below: enter image description here csv file with columns - path

I would like to first import the first cell with the hyperlink (in blue) and extract everything in plaintext format (just the content) and put the output into the column next to column "path"

I have the following codes so far: (first line importing the column with hyperlinks)

data1 = dataset[1, All, 6];

data2 = data1[3]; <-this line is referring to the third row

data3 = StringSplit[Import[data2, "Plaintext"], ","];

data4 = StringReplace[#, (StartOfString ~~ ",") | ("," ~~ 
         EndOfString) :> ""] & /@ data3;
data5 = StringReplace[#, (StartOfString ~~ Whitespace) | (Whitespace ~~
          EndOfString) :> ""] & /@ data4;
data6 = StringSplit[ToString[data5], " "];
data7 = StringSplit[data6, ".htm "];
I have the output:

{{"Description"}, {"Document"}, {"Type"}, {"Size
  "}, {"1"}, {"PROXY"}, {"2010"}, {"proxy2010.htm"}, {"DEF"}, \
{"14A"}, {"717341
  "}}

I would like to take the part where it says "xxx.htm" <- with xxx constantly changing for every row, but it should always have ".htm" at the end.

My question is

****From the output above, can I take the ".htm" part of the output and store the .htm address to a variable? Can I run this entire process from rows 2 through 100? (loop) Thank you,****

POSTED BY: Young Il Baik
13 Replies

Welcome to Wolfram Community!
Please make sure you know the rules: https://wolfr.am/READ-1ST
Images don't help other members to answer your question. They need to copy the elements involved in your question such as code and data.
If you don't want to share your data file, you can create a sample file that contains part of your data and attach it, otherwise we won't be able to help you.

Please next time mention that this post is a continuation for another earlier question that you already had some answers on it.

POSTED BY: Moderation Team
Posted 3 years ago

Please attach the CSV file (or a few rows from it) to your post.

POSTED BY: Rohit Namjoshi
Posted 3 years ago

Hello, Rohit, Thank you for your reply - i have uploaded a screenshot of the CSV file - For some reason the website is placing two of the same pictures.

POSTED BY: Young Il Baik
Posted 3 years ago

Try to make it easy for people trying to help you. Remember that they have jobs and are doing this for free in their spare time. Instead of expecting people to manually type in a long URL from an image, why nor provide a small list of URL's that can be copied and pasted.

You asked a very similar question here and I provided an answer using Map. Did you try that?

POSTED BY: Rohit Namjoshi
Posted 3 years ago

Hi Rohit, I have now attached a sample csv. I have tried to use the map function, but I was not able to obtain the results. So above, i am using another method to retrieve the data, and taking your previous advice, I am partitioning my data gathering process into three parts.

1) the urls in the csv file above takes me to an index page. What I am trying to do above is to go into each of the URLs and scrape the part where it says "xxx.htm" and put it into a variable because I would like to use the "xxx.htm" part to get to the actual webpage that I would like to scrape.

2) the next process I would like to conduct (not part of this question) is to have Mathematica go into each of the links and download the texts of the .htms I get from step #1

3) then I will run a textual analysis based on the saved texts from #2

Sorry for the confusion I hope I'm making it clearer. I truly appreciate your time!

POSTED BY: Young Il Baik
Posted 3 years ago

Hi Rohit, I have also attached a currently working version of the notebook - Thank you!

POSTED BY: Young Il Baik
Posted 3 years ago

Hi Young,

Thanks for providing the data in a usable form.

In your example you extracted "proxy2010.htm". Is that the end result you want for each row in the CSV? Based on the rest of the question, seems like the answer is "No", because you want to scrape "proxy2010.htm", but that is not possible because it is not a full URL, domain and path are missing.

Maybe you want this

getProxyStatementURL[link_] := Import[link, "Hyperlinks"] // 
   Select[StringMatchQ[#, __ ~~ "Archives" ~~ __ ~~ ".htm"] &] // 
   First

Import["~/Downloads/def14asample.csv", "Dataset", HeaderLines -> 1]

dataWithProxyStatementURL = 
 data[All, <|#, "proxyStatementURL" -> getProxyStatementURL[#["path"]]|> &]

The result is a Dataset with a new column proxyStatementURL which has the full URL to the proxy statement, so rather than just "proxy2010.htm", it is "https://www.sec.gov/Archives/edgar/data/34782/000003478210000011/proxy2010.htm"

You can then Import the plaintext from those URL's and do whatever processing you need to do. If you need help with that part, please specify the details in your response.

POSTED BY: Rohit Namjoshi
Posted 3 years ago

First of all, Thank you for your reply and I truly appreciate your help.

  1. I have run the code you wrote in your reply, but for some reason, I am not able to recreate the new column "proxyStatementURL" which has the full URL to the actual proxy statement. would it be possible that i am not seeing a column because I did not specify what "data" is in the code you wrote above? It still appears in blue after I ran the code you wrote(by itself not in conjunction with my previous codes"

  2. As you mentioned, the urls under "path" column are not the end result I want, but you are 100% accurate that the actual proxy statement is what I would like. Therefore, I wanted Mathematica to look into each of the raw links get me back texts of each of the raw links so I could take out just the part (such as xxx.htm) that I can further use to get to the proxy statement. However, it seems that you were already able to do that based on your reply.

I have the codes (in the previously attached file) that I can use to import the plaintext from those proxy statements and further do textual analysis. However, may I also ask how i could do this analysis for 100 rows (100 proxy statements)?

Once again, thank you for your time!

POSTED BY: Young Il Baik
Posted 3 years ago

Not sure why it did not work. Did you change the path to "def14asample.csv" to the right one on your system. Anyway, I have attached a notebook with the working solution. To make it simpler, just make sure "def14asample.csv" is in the same folder as the notebook.

However, may I also ask how i could do this analysis for 100 rows (100 proxy statements)?

The code in the notebook will process all of the rows from the imported CSV. You can see this in the attached notebook.

Attachments:
POSTED BY: Rohit Namjoshi
Posted 3 years ago

Hi Rohit, My apologies for the delay in getting back to you. As i am a second year graduate student the I had to take care of some coursework urgencies (exams) this past few days.

First of all, i would like to sincerely thank you for your answer and the attached notebook - it worked perfectly and i cannot thank you enough!!

From here on, I wanted to move onto the textual analysis now that i have a compiled list of proxy statement addresses. I have already put together a notebook that conducts some textual analyses and all i need to do now is to have Mathematica to loop through each of the hyperlinks in the proxyStatementURL column that you created.

Would you mind if i ask for your help on how to set up the loop?

I am attaching my textual analysis file here. The textual analysis as of the current version is working fine, but the problem is that i have to constantly change the row number for each address. I hope my question is clear - please let me know if i can clarify anything!

Thank you truly once again for all your help in getting me through this obstacle!

POSTED BY: Young Il Baik
Posted 3 years ago

Hi Young,

Looks like there are issues with the code in "Notebook-9-3-Conservative Range COPY.nb", several cells failed to evaluate (I did not evaluate any cells, the errors are present in the notebook). I have a hard time following the code in the notebook.

Some expressions have much simpler implementations

StringReplace[#, (StartOfString ~~ Whitespace) | (Whitespace ~~ EndOfString) :> ""] & /@ data0;

is the same as the following, assuming you want to remove all whitespaces from the beginning and end of the string.

StringTrim@data1

Some parts of the code look incorrect

y2019 = Length[StringPosition[data2, "2019"]];
y2018 = Length[StringPosition[data2, "2018"]];
y2017 = Length[StringPosition[data2, "2017"]];
y2016 = Length[StringPosition[data2, "2016"]];
y2015 = Length[StringPosition[data2, "2015"]];
y2014 = Length[StringPosition[data2, "2014"]];
y2013 = Length[StringPosition[data2, "2013"]];
y2012 = Length[StringPosition[data2, "2012"]];
y2011 = Length[StringPosition[data2, "2011"]];
y2010 = Length[StringPosition[data2, "2010"]];

Since data2 is a list of strings, StringPosition is going to return a list for every string (empty list for no match). The Length is going to be identical in all cases so this is not going to give the right fiscal year

fiscalyear = 
  Max[y2019, y2018, y2017, y2016, y2015, y2014, y2013, y2012, y2011, y2010];

Rather than enumerating every year, why not search for a pattern

"20"~~DigitCharacter~~DigitCharacter

Also not clear why the Max is used to determine fiscal year. StringPosition is going to return the position in each string, not the position in the entire document.

When the following is evaluated, only a and b are defined, so Max will not evaluate, so If will not evaluate.

truename = 
  If[a == Max[a, b, c, d, e, f, g, h, i, j, k], 
   "name of registrant as specified in its charter)", 
   If[b == Max[a, b, c, d, e, f, g, h, i, j, k], 
    "name of registrant as specified in its certificate)"]];

The rest of the code has even more complex logic that looks incorrect and there are no comments in the code.

Anyway, assuming you are able to get your code to work correctly, wrap it in a function that accepts the proxy url and returns the desired result. Here is an example that just returns an association of word and sentence counts.

processProxyURL[proxyURL_] := 
 Module[{plainText, allWords, nonStopWords, sentences},
  plainText = Import[proxyURL, "Plaintext"];
  allWords = TextWords@plainText;
  nonStopWords = DeleteStopwords@allWords;
  sentences = TextSentences@plainText;
  <|"Words" -> Length@allWords, 
   "Non Stopwords" -> Length@nonStopWords, 
   "Sentences" -> Length@sentences|>
  ]

To run it on every proxy URL in the dataset and add the result as a new column

dataWithProxyStatementURL[All, <|#, "Processed Data" -> processProxyURL[#["proxyStatementURL"]]|> &]
POSTED BY: Rohit Namjoshi
Posted 3 years ago

Thank you, Rohit once again for your wonderful response. I am very close to finishing this notebook and i truly appreciate all your help in the process. I have updated the notebook to first remove all the unnecessary codes and it should now look much cleaner.

I have one last question that i wanted to ask you. The "plainText", "allWords" and "nonStopWords" functions are tremendously helpful and i can definitely see how I can utilize those functions down the road.

What i would hope to do now is search for a few keywords in the "all words" variable such that i can finally count the number of keywords that are included in the proxy statements.

Below, i have attached the notebook that now conducts textual analysis but fails when I try to search for a few keywords. Would i need to be using another function that Position? I tried FindList and TextSearch or StringPosition, but these don't seem to work

Would you mind briefly taking a look to point me in the right direction?

Thank you so much Rohit!! P.S. the dataset def14samplenew0123.csv is just a sample file that only has 4 links to proxy statements to see if the analysis worked.

POSTED BY: Young Il Baik
Posted 3 years ago

Hi Young,

Glad to hear that you are making progress. There are a few problems with the code.

allWords is a list of words, so a phrase like "based on discretion" will never match a word. The match should be done against plainText. The right function to use is StringCount, not Position. allWords contains newlines, so a phrase you are looking for can be split across lines and will not match exactly. The word/phrase in plainText may be in mixed case, but the words/phrases you are looking for are all lowercase. Much better to use allWords converted to lowercase and joined with spaces to form a single string of all the words.

keyWords = {"judgment-based", "based on discretion", "exercise judgment", "exercising judgment", 
   "exercise discretion", "exercising discretion", "subjectivity", "discretion", 
   "not based on any mathematical", "not quantifiable", "adjusted upward", "adjusted up", 
   "upward adjustment", "qualitative",  "non-financial", "nonfinancial"};

cleanPlainText = StringRiffle@ToLowerCase@allWords
keyWordCounts = AssociationMap[StringCount[cleanPlainText, #] &, keyWords]

(*
<|"judgment-based" -> 0, "based on discretion" -> 0, 
 "exercise judgment" -> 0, "exercising judgment" -> 0, 
 "exercise discretion" -> 0, "exercising discretion" -> 0, 
 "subjectivity" -> 0, "discretion" -> 2, 
 "not based on any mathematical" -> 0, "not quantifiable" -> 0, 
 "adjusted upward" -> 0, "adjusted up" -> 0, "upward adjustment" -> 0,
  "qualitative" -> 0, "non-financial" -> 0, "nonfinancial" -> 0|>
*)

Similarly for keywords2 (give it a better name) and trapWords. Add keyWordCounts and the counts for the other lists to the results association i.e. "keywordsfrequency" -> keyWordCounts. etc.

You should manually check that the counts match by searching for the text in the browser to make sure there are no other edge cases to be considered.

POSTED BY: Updating Name
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract