Group Abstract Group Abstract

Message Boards Message Boards

0
|
4.3K Views
|
13 Replies
|
5 Total Likes
View groups...
Share
Share this post:

Storing output into a variable + looping function

Posted 5 years ago

Hello, I have a csv file or an excel that looks as below: enter image description here csv file with columns - path

I would like to first import the first cell with the hyperlink (in blue) and extract everything in plaintext format (just the content) and put the output into the column next to column "path"

I have the following codes so far: (first line importing the column with hyperlinks)

data1 = dataset[1, All, 6];

data2 = data1[3]; <-this line is referring to the third row

data3 = StringSplit[Import[data2, "Plaintext"], ","];

data4 = StringReplace[#, (StartOfString ~~ ",") | ("," ~~ 
         EndOfString) :> ""] & /@ data3;
data5 = StringReplace[#, (StartOfString ~~ Whitespace) | (Whitespace ~~
          EndOfString) :> ""] & /@ data4;
data6 = StringSplit[ToString[data5], " "];
data7 = StringSplit[data6, ".htm "];
I have the output:

{{"Description"}, {"Document"}, {"Type"}, {"Size
  "}, {"1"}, {"PROXY"}, {"2010"}, {"proxy2010.htm"}, {"DEF"}, \
{"14A"}, {"717341
  "}}

I would like to take the part where it says "xxx.htm" <- with xxx constantly changing for every row, but it should always have ".htm" at the end.

My question is

****From the output above, can I take the ".htm" part of the output and store the .htm address to a variable? Can I run this entire process from rows 2 through 100? (loop) Thank you,****

POSTED BY: Young Il Baik
13 Replies

Welcome to Wolfram Community!
Please make sure you know the rules: https://wolfr.am/READ-1ST
Images don't help other members to answer your question. They need to copy the elements involved in your question such as code and data.
If you don't want to share your data file, you can create a sample file that contains part of your data and attach it, otherwise we won't be able to help you.

Please next time mention that this post is a continuation for another earlier question that you already had some answers on it.

POSTED BY: EDITORIAL BOARD
Posted 5 years ago

Please attach the CSV file (or a few rows from it) to your post.

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Hello, Rohit, Thank you for your reply - i have uploaded a screenshot of the CSV file - For some reason the website is placing two of the same pictures.

POSTED BY: Young Il Baik
Posted 5 years ago

Try to make it easy for people trying to help you. Remember that they have jobs and are doing this for free in their spare time. Instead of expecting people to manually type in a long URL from an image, why nor provide a small list of URL's that can be copied and pasted.

You asked a very similar question here and I provided an answer using Map. Did you try that?

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Hi Rohit, I have now attached a sample csv. I have tried to use the map function, but I was not able to obtain the results. So above, i am using another method to retrieve the data, and taking your previous advice, I am partitioning my data gathering process into three parts.

1) the urls in the csv file above takes me to an index page. What I am trying to do above is to go into each of the URLs and scrape the part where it says "xxx.htm" and put it into a variable because I would like to use the "xxx.htm" part to get to the actual webpage that I would like to scrape.

2) the next process I would like to conduct (not part of this question) is to have Mathematica go into each of the links and download the texts of the .htms I get from step #1

3) then I will run a textual analysis based on the saved texts from #2

Sorry for the confusion I hope I'm making it clearer. I truly appreciate your time!

POSTED BY: Young Il Baik
Posted 5 years ago

Hi Rohit, I have also attached a currently working version of the notebook - Thank you!

POSTED BY: Young Il Baik
Posted 5 years ago

Hi Young,

Thanks for providing the data in a usable form.

In your example you extracted "proxy2010.htm". Is that the end result you want for each row in the CSV? Based on the rest of the question, seems like the answer is "No", because you want to scrape "proxy2010.htm", but that is not possible because it is not a full URL, domain and path are missing.

Maybe you want this

getProxyStatementURL[link_] := Import[link, "Hyperlinks"] // 
   Select[StringMatchQ[#, __ ~~ "Archives" ~~ __ ~~ ".htm"] &] // 
   First

Import["~/Downloads/def14asample.csv", "Dataset", HeaderLines -> 1]

dataWithProxyStatementURL = 
 data[All, <|#, "proxyStatementURL" -> getProxyStatementURL[#["path"]]|> &]

The result is a Dataset with a new column proxyStatementURL which has the full URL to the proxy statement, so rather than just "proxy2010.htm", it is "https://www.sec.gov/Archives/edgar/data/34782/000003478210000011/proxy2010.htm"

You can then Import the plaintext from those URL's and do whatever processing you need to do. If you need help with that part, please specify the details in your response.

POSTED BY: Rohit Namjoshi
Posted 5 years ago

First of all, Thank you for your reply and I truly appreciate your help.

  1. I have run the code you wrote in your reply, but for some reason, I am not able to recreate the new column "proxyStatementURL" which has the full URL to the actual proxy statement. would it be possible that i am not seeing a column because I did not specify what "data" is in the code you wrote above? It still appears in blue after I ran the code you wrote(by itself not in conjunction with my previous codes"

  2. As you mentioned, the urls under "path" column are not the end result I want, but you are 100% accurate that the actual proxy statement is what I would like. Therefore, I wanted Mathematica to look into each of the raw links get me back texts of each of the raw links so I could take out just the part (such as xxx.htm) that I can further use to get to the proxy statement. However, it seems that you were already able to do that based on your reply.

I have the codes (in the previously attached file) that I can use to import the plaintext from those proxy statements and further do textual analysis. However, may I also ask how i could do this analysis for 100 rows (100 proxy statements)?

Once again, thank you for your time!

POSTED BY: Young Il Baik
Posted 5 years ago

Not sure why it did not work. Did you change the path to "def14asample.csv" to the right one on your system. Anyway, I have attached a notebook with the working solution. To make it simpler, just make sure "def14asample.csv" is in the same folder as the notebook.

However, may I also ask how i could do this analysis for 100 rows (100 proxy statements)?

The code in the notebook will process all of the rows from the imported CSV. You can see this in the attached notebook.

Attachments:
POSTED BY: Rohit Namjoshi
Posted 5 years ago
POSTED BY: Young Il Baik
Posted 5 years ago
POSTED BY: Rohit Namjoshi
Posted 5 years ago
POSTED BY: Young Il Baik
Posted 5 years ago
POSTED BY: Updating Name
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard