Hello everyone,
I'm trying to extract data on share ownership from a document called DEF14A, which is a form filed by companies with the Securities and Exchange Commission. I am able to import the documents in HTML from the SEC using Mathematica. If a document contains a share ownership table, the table is usually preceded by the (start) string "SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT" and ends with the tag that indicates the end of a table (the tag won't appear in this post if I type it here). I have been using StringPosition with the start string and the tag to try to pull out the table and some surrounding text but have had mixed results because it appears that start string may not be found if the person entering it inserted a carriage return somewhere within the string, and because I don't fully understand StringPosition's treatment of overlaps: I want simply the first occurrence of the start string followed by the tag with anything in between except the start string and the end tag.
I'd very much appreciate any tips on extracting the owners' names and the number of shares they hold. I've attached an HTML snippet of the DEF14A, although it is long and probably not of much interest to anyone.
Regards,
Gregory
Attachments: