Message Boards Message Boards

Extracting Data From an Imported HTML Document

Posted 10 years ago

Hello everyone,

I'm trying to extract data on share ownership from a document called DEF14A, which is a form filed by companies with the Securities and Exchange Commission. I am able to import the documents in HTML from the SEC using Mathematica. If a document contains a share ownership table, the table is usually preceded by the (start) string "SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT" and ends with the tag that indicates the end of a table (the tag won't appear in this post if I type it here). I have been using StringPosition with the start string and the tag to try to pull out the table and some surrounding text but have had mixed results because it appears that start string may not be found if the person entering it inserted a carriage return somewhere within the string, and because I don't fully understand StringPosition's treatment of overlaps: I want simply the first occurrence of the start string followed by the tag with anything in between except the start string and the end tag.

I'd very much appreciate any tips on extracting the owners' names and the number of shares they hold. I've attached an HTML snippet of the DEF14A, although it is long and probably not of much interest to anyone.

Regards,

Gregory

Attachments:
POSTED BY: Gregory Lypny
7 Replies
Posted 10 years ago

Thanks Hans,

I'll study your code.

Gregory

POSTED BY: Gregory Lypny

Gregory: Try the following function

getBeneficialfromSECDEF14A[cik_] := 
  Module[{paddedCIK, urlfullpath, searchResults, textOnlylinks, 
    top1linkfromList, formDEF14A, htmltagstartpos, htmltagendpos, 
    htmlDEF14A, DEF14ANoAttribs, tablestartpos, tablestartnearfunc, 
    tableendpos, tableendnearfunc, benpos, tablestartnearest, 
    tableendnearest, 
    tablestarttally, tablestartcommon, tableendtally, tableendcommon, 
    bentablestart, bentableend, bentable},
   paddedCIK = IntegerString[ToExpression[cik], 10, 10];
   urlfullpath = 
    "http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> paddedCIK <>
      "+TYPE%3DDEF&first=1994&last=" <> 
     DateString[DateList[], "Year"] <> "";
   searchResults = Import[urlfullpath, "Hyperlinks"];
   textOnlylinks = 
    Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]];
   top1linkfromList = First[textOnlylinks];
   formDEF14A = Import[top1linkfromList, "Plaintext"];

   htmltagstartpos = 
    StringPosition[formDEF14A, "<html>", IgnoreCase -> True];
   htmltagendpos = 
    StringPosition[formDEF14A, "</html>", IgnoreCase -> True];
   htmlDEF14A = 
    StringTake[
     formDEF14A, {First[Flatten[htmltagstartpos]], 
      Last[Flatten[htmltagendpos]]}];

   DEF14ANoAttribs = 
    StringReplace[htmlDEF14A, 
     RegularExpression["(<\\w+)[^>]*(>)"] -> "$1$2"];
    DEF14ANoAttribs = 
    StringReplace[
     DEF14ANoAttribs, {"<br>" -> " ", "<hr>" -> " ", "&nbsp;" -> " ", 
      "<u>" -> "", "</u>" -> "", "<b>" -> "", "</b>" -> "", 
      "</B>" -> "", "<font>" -> " ", "</font>" -> " ", 
      "<FONT>" -> " ", "</FONT>" -> " ", "<small>" -> "", 
      "</small>" -> "", "> " -> ">", " <" -> "<", "  " -> " "}, 
     IgnoreCase -> True];
    DEF14ANoAttribs = 
    StringReplace[DEF14ANoAttribs, RegularExpression["\\n\\n"] -> ""];
    DEF14ANoAttribs = ReplaceRepeated[DEF14ANoAttribs, {"  " -> " "}];
   tablestartpos = 
    StringPosition[DEF14ANoAttribs, "<table>", IgnoreCase -> True];
   tablestartnearfunc = Nearest[tablestartpos];

   tableendpos = 
    StringPosition[DEF14ANoAttribs, "</table>", IgnoreCase -> True];
   tableendnearfunc = Nearest[tableendpos];

   benpos = 
    StringPosition[DEF14ANoAttribs, "beneficial", IgnoreCase -> True];
   (*bnearfunc=Nearest[bpos];*)

   tablestartnearest = Flatten[Map[tablestartnearfunc, benpos], 1];
   tableendnearest = Flatten[Map[tableendnearfunc, benpos], 1];
   tablestarttally = Tally[tablestartnearest];
   tablestartcommon = Commonest[tablestartnearest];
   tableendtally = Tally[tableendnearest];
   tableendcommon = Commonest[tableendnearest];
   bentablestart = Min[tablestartcommon];
   bentableend = Min[tableendcommon];
   If[bentableend < bentablestart, 
    (* find other table end*)
    bentableend = 
      SelectFirst[tableendnearest[[All, 2]], bentablestart < # &];
    ];
   bentable = 
    ImportString[
     StringTake[
      DEF14ANoAttribs, {bentablestart, bentableend}], {"HTML", 
      "Data"}];
   Return[bentable];];

With a test of HPQ (Hewlett Packard) cik of 47217

getBeneficialfromSECDEF14A[47217]
{{{"Name of Beneficial Owner", 
   "Shares of Common Stock Beneficially Owned", 
   "Percent of Common Stock Outstanding"}, {"Dodge & Cox (1)", 
   "171,145,618", 9., "%"}, {"State Street Corporation (2)", 
   "97,792,253", 5.1, "%"}, {"Marc L. Andreessen (3)", "40,740", 
   "*"}, {"Shumeet Banerji", "32,694", "*"}, {"Robert R. Bennett", 
   "4,262", "*"}, {"Rajiv L. Gupta (4)", "71,271", 
   "*"}, {"Klaus Kleinfeld", "\[LongDash]", 
   "*"}, {"Raymond J. Lane (5)", "462,618", 
   "*"}, {"Ann M. Livermore (6)", "318,742", 
   "*"}, {"Raymond E. Ozzie", "4,262", "*"}, {"Gary M. Reiner (7)", 
   "82,535", "*"}, {"Patricia F. Russo (8)", "20,888", 
   "*"}, {"James A. Skinner", "4,262", 
   "*"}, {"Margaret C. Whitman (9)", "4,419,346", 
   "*"}, {"Catherine A. Lesjak (10)", "875,905", 
   "*"}, {"William L. Veghte (11)", "385,953", 
   "*"}, {"Dion J. Weisler (12)", "12,500", 
   "*"}, {"Michael G. Nefkens (13)", "461,979", 
   "*"}, {"All current executive officers and directors as a group \
(24 persons) (14)", "7,838,018", "*"}}, {"*", 
  "Represents holdings of less than 1%."}}

I also tried this with Microsoft and the function seems to have extracted the Beneficial Owner table

getBeneficialfromSECDEF14A[789019]

Give this a trial run it may be slow you may wish to remove the attribute and other html cleaning. I was also taking advantage of the built-in Nearest function. Maybe this function could be optimized by using the correct options.

(Addition) Here is a different version without the cleaning of the HTML leaving that to built-in commands:

getBeneficialfromSECDEF14A[cik_] := 
  Module[{formDEF14A, tablestartpos, tablestartnearfunc, tableendpos, 
    tableendnearfunc, benpos, tablestartnearest, tableendnearest, 
    tablestartcommon, tableendcommon, bentablestart, bentableend, 
    bentable} ,
   formDEF14A = 
    Import[SelectFirst[
      Import["http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> 
        IntegerString[ToExpression[cik], 10, 10] <> 
        "+TYPE%3DDEF&first=1994&last=" <> 
        DateString[DateList[], "Year"], "Hyperlinks"], 
      Function[StringMatchQ[#, "*.txt"] == True]], "Plaintext"];
   tablestartpos = 
    StringPosition[formDEF14A, "<table", IgnoreCase -> True];
   tablestartnearfunc = Nearest[tablestartpos];
   tableendpos = 
    StringPosition[formDEF14A, "</table>", IgnoreCase -> True];
   tableendnearfunc = Nearest[tableendpos];
   benpos = 
    StringPosition[formDEF14A, "beneficial", IgnoreCase -> True];
   tablestartnearest = Flatten[Map[tablestartnearfunc, benpos], 1];
   tableendnearest = Flatten[Map[tableendnearfunc, benpos], 1];
   tablestartcommon = Commonest[tablestartnearest];
   tableendcommon = Commonest[tableendnearest];
   bentablestart = Min[tablestartcommon];
   bentableend = Min[tableendcommon];
   If[bentableend < bentablestart,(*find other table end*)
    bentableend = 
      SelectFirst[tableendnearest[[All, 2]], 
       Fion[Less[bentablestart, #]]];];
   bentable = 
    ImportString[
     StringTake[formDEF14A, {bentablestart, benteend}], {"HTML", 
      "Data"}];
   Return[bentable];];

Hans

POSTED BY: Hans Michel
POSTED BY: Hans Michel
Posted 10 years ago
POSTED BY: Gregory Lypny
Posted 10 years ago

My advice is to try with the Import[] command with a syntax like:

Import["http://example.com/abc.html", {"HTML", "Data"}]

You can further refine the imported data specifying subelements like:

Import["http://example.com/abc.html", {"HTML", "Data",2}]
Import["http://example.com/abc.html", {"HTML", "Data",2,1}]

and so on. (Instead on 2 and 1 use the numbers that make sense for your data)

Then use Cases[] to complete the task.

This is easier than working with the XML representation. Give it a try.

POSTED BY: Gustavo Delfino
Posted 10 years ago

Thanks Gustavo,

Sorry for my delay in responding. I'm going to give

Import[page,{"HTML","Data"}]

a try. Looks promising.

Gregory

POSTED BY: Gregory Lypny

I'm not able to take a deep look at the file right now, but since (X)HTML is a subset of XML, parsing your HTML as XML using Mathematica's XML processing capabilities might make handling the data easier.

Additionally, Mathematica can directly import XHTML, but I'm not sure how easy that is to work with.

POSTED BY: Jesse Friedman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract