Message Boards Message Boards

Extracting Data From an Imported HTML Document

Posted 10 years ago

Hello everyone,

I'm trying to extract data on share ownership from a document called DEF14A, which is a form filed by companies with the Securities and Exchange Commission. I am able to import the documents in HTML from the SEC using Mathematica. If a document contains a share ownership table, the table is usually preceded by the (start) string "SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT" and ends with the tag that indicates the end of a table (the tag won't appear in this post if I type it here). I have been using StringPosition with the start string and the tag to try to pull out the table and some surrounding text but have had mixed results because it appears that start string may not be found if the person entering it inserted a carriage return somewhere within the string, and because I don't fully understand StringPosition's treatment of overlaps: I want simply the first occurrence of the start string followed by the tag with anything in between except the start string and the end tag.

I'd very much appreciate any tips on extracting the owners' names and the number of shares they hold. I've attached an HTML snippet of the DEF14A, although it is long and probably not of much interest to anyone.

Regards,

Gregory

Attachments:
POSTED BY: Gregory Lypny
7 Replies
Posted 10 years ago

My advice is to try with the Import[] command with a syntax like:

Import["http://example.com/abc.html", {"HTML", "Data"}]

You can further refine the imported data specifying subelements like:

Import["http://example.com/abc.html", {"HTML", "Data",2}]
Import["http://example.com/abc.html", {"HTML", "Data",2,1}]

and so on. (Instead on 2 and 1 use the numbers that make sense for your data)

Then use Cases[] to complete the task.

This is easier than working with the XML representation. Give it a try.

POSTED BY: Gustavo Delfino

Gregory: I believe you might be using the starter function I responded to a while back. Below is slight modification just change hard coded year to get most recent year, so this returns the most recent raw text of the DEF14A filing from a list range 1994 to current year for a given ticker

processSECDEF14A[ticker_] := 
  Module[{cik, paddedCIK, urlfullpath, searchResults, textOnlylinks, 
    top1linkfromList, formDEF14A}, cik = FinancialData[ticker, "CIK"];
   paddedCIK = IntegerString[ToExpression[cik], 10, 10];
   urlfullpath = 
    "http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> paddedCIK <>
      "+TYPE%3DDEF&first=1994&last=" <> 
     DateString[DateList[], "Year"] <> "";
   searchResults = Import[urlfullpath, "Hyperlinks"];
   textOnlylinks = 
    Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]];
   top1linkfromList = First[textOnlylinks];
   formDEF14A = Import[top1linkfromList, "Plaintext"];
   Return[formDEF14A];];

I tried this for HEWLETT PACKARD CO

hpDEF14A = processSECDEF14A["HPQ"];

The raw data is as previously stated, SGML,HTML, JPG, PDF combo document. In order find the beneficial ownerships (section16a) there will be some data mining necessary. The raw data HTML segments are created by many different edgar filing service providers (software). But all of them are convoluted HTML so I would suggest working with both the raw file and a cleaner version. The following simple statement will remove all HTML element attributes:

hpDEF14ANoAttribs = StringReplace[hpDEF14A, 
 RegularExpression["(<\\w+)[^>]*(>)"] -> "$1$2"];

You can perform some further cleaning such as removing font elements or any other cleaning necessary. The trick as you are trying to accomplish is to get the HTML Table that contains the beneficial ownership list. Footnote would have to be dealt with later. But using:

ImportString["Table Fragment Here", {"HTML", "Data"}];

I'm still thinking about a way to grab the required table. These HTML files are so complicated that I don't think "FullData" would return a clean data set. StringPosition may come in handy but figuring out the proper algorithm is more critical to find a solution. Something like find html table that contains the word "beneficial", etc

POSTED BY: Hans Michel

Gregory: Try the following function

getBeneficialfromSECDEF14A[cik_] := 
  Module[{paddedCIK, urlfullpath, searchResults, textOnlylinks, 
    top1linkfromList, formDEF14A, htmltagstartpos, htmltagendpos, 
    htmlDEF14A, DEF14ANoAttribs, tablestartpos, tablestartnearfunc, 
    tableendpos, tableendnearfunc, benpos, tablestartnearest, 
    tableendnearest, 
    tablestarttally, tablestartcommon, tableendtally, tableendcommon, 
    bentablestart, bentableend, bentable},
   paddedCIK = IntegerString[ToExpression[cik], 10, 10];
   urlfullpath = 
    "http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> paddedCIK <>
      "+TYPE%3DDEF&first=1994&last=" <> 
     DateString[DateList[], "Year"] <> "";
   searchResults = Import[urlfullpath, "Hyperlinks"];
   textOnlylinks = 
    Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]];
   top1linkfromList = First[textOnlylinks];
   formDEF14A = Import[top1linkfromList, "Plaintext"];

   htmltagstartpos = 
    StringPosition[formDEF14A, "<html>", IgnoreCase -> True];
   htmltagendpos = 
    StringPosition[formDEF14A, "</html>", IgnoreCase -> True];
   htmlDEF14A = 
    StringTake[
     formDEF14A, {First[Flatten[htmltagstartpos]], 
      Last[Flatten[htmltagendpos]]}];

   DEF14ANoAttribs = 
    StringReplace[htmlDEF14A, 
     RegularExpression["(<\\w+)[^>]*(>)"] -> "$1$2"];
    DEF14ANoAttribs = 
    StringReplace[
     DEF14ANoAttribs, {"<br>" -> " ", "<hr>" -> " ", "&nbsp;" -> " ", 
      "<u>" -> "", "</u>" -> "", "<b>" -> "", "</b>" -> "", 
      "</B>" -> "", "<font>" -> " ", "</font>" -> " ", 
      "<FONT>" -> " ", "</FONT>" -> " ", "<small>" -> "", 
      "</small>" -> "", "> " -> ">", " <" -> "<", "  " -> " "}, 
     IgnoreCase -> True];
    DEF14ANoAttribs = 
    StringReplace[DEF14ANoAttribs, RegularExpression["\\n\\n"] -> ""];
    DEF14ANoAttribs = ReplaceRepeated[DEF14ANoAttribs, {"  " -> " "}];
   tablestartpos = 
    StringPosition[DEF14ANoAttribs, "<table>", IgnoreCase -> True];
   tablestartnearfunc = Nearest[tablestartpos];

   tableendpos = 
    StringPosition[DEF14ANoAttribs, "</table>", IgnoreCase -> True];
   tableendnearfunc = Nearest[tableendpos];

   benpos = 
    StringPosition[DEF14ANoAttribs, "beneficial", IgnoreCase -> True];
   (*bnearfunc=Nearest[bpos];*)

   tablestartnearest = Flatten[Map[tablestartnearfunc, benpos], 1];
   tableendnearest = Flatten[Map[tableendnearfunc, benpos], 1];
   tablestarttally = Tally[tablestartnearest];
   tablestartcommon = Commonest[tablestartnearest];
   tableendtally = Tally[tableendnearest];
   tableendcommon = Commonest[tableendnearest];
   bentablestart = Min[tablestartcommon];
   bentableend = Min[tableendcommon];
   If[bentableend < bentablestart, 
    (* find other table end*)
    bentableend = 
      SelectFirst[tableendnearest[[All, 2]], bentablestart < # &];
    ];
   bentable = 
    ImportString[
     StringTake[
      DEF14ANoAttribs, {bentablestart, bentableend}], {"HTML", 
      "Data"}];
   Return[bentable];];

With a test of HPQ (Hewlett Packard) cik of 47217

getBeneficialfromSECDEF14A[47217]
{{{"Name of Beneficial Owner", 
   "Shares of Common Stock Beneficially Owned", 
   "Percent of Common Stock Outstanding"}, {"Dodge & Cox (1)", 
   "171,145,618", 9., "%"}, {"State Street Corporation (2)", 
   "97,792,253", 5.1, "%"}, {"Marc L. Andreessen (3)", "40,740", 
   "*"}, {"Shumeet Banerji", "32,694", "*"}, {"Robert R. Bennett", 
   "4,262", "*"}, {"Rajiv L. Gupta (4)", "71,271", 
   "*"}, {"Klaus Kleinfeld", "\[LongDash]", 
   "*"}, {"Raymond J. Lane (5)", "462,618", 
   "*"}, {"Ann M. Livermore (6)", "318,742", 
   "*"}, {"Raymond E. Ozzie", "4,262", "*"}, {"Gary M. Reiner (7)", 
   "82,535", "*"}, {"Patricia F. Russo (8)", "20,888", 
   "*"}, {"James A. Skinner", "4,262", 
   "*"}, {"Margaret C. Whitman (9)", "4,419,346", 
   "*"}, {"Catherine A. Lesjak (10)", "875,905", 
   "*"}, {"William L. Veghte (11)", "385,953", 
   "*"}, {"Dion J. Weisler (12)", "12,500", 
   "*"}, {"Michael G. Nefkens (13)", "461,979", 
   "*"}, {"All current executive officers and directors as a group \
(24 persons) (14)", "7,838,018", "*"}}, {"*", 
  "Represents holdings of less than 1%."}}

I also tried this with Microsoft and the function seems to have extracted the Beneficial Owner table

getBeneficialfromSECDEF14A[789019]

Give this a trial run it may be slow you may wish to remove the attribute and other html cleaning. I was also taking advantage of the built-in Nearest function. Maybe this function could be optimized by using the correct options.

(Addition) Here is a different version without the cleaning of the HTML leaving that to built-in commands:

getBeneficialfromSECDEF14A[cik_] := 
  Module[{formDEF14A, tablestartpos, tablestartnearfunc, tableendpos, 
    tableendnearfunc, benpos, tablestartnearest, tableendnearest, 
    tablestartcommon, tableendcommon, bentablestart, bentableend, 
    bentable} ,
   formDEF14A = 
    Import[SelectFirst[
      Import["http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> 
        IntegerString[ToExpression[cik], 10, 10] <> 
        "+TYPE%3DDEF&first=1994&last=" <> 
        DateString[DateList[], "Year"], "Hyperlinks"], 
      Function[StringMatchQ[#, "*.txt"] == True]], "Plaintext"];
   tablestartpos = 
    StringPosition[formDEF14A, "<table", IgnoreCase -> True];
   tablestartnearfunc = Nearest[tablestartpos];
   tableendpos = 
    StringPosition[formDEF14A, "</table>", IgnoreCase -> True];
   tableendnearfunc = Nearest[tableendpos];
   benpos = 
    StringPosition[formDEF14A, "beneficial", IgnoreCase -> True];
   tablestartnearest = Flatten[Map[tablestartnearfunc, benpos], 1];
   tableendnearest = Flatten[Map[tableendnearfunc, benpos], 1];
   tablestartcommon = Commonest[tablestartnearest];
   tableendcommon = Commonest[tableendnearest];
   bentablestart = Min[tablestartcommon];
   bentableend = Min[tableendcommon];
   If[bentableend < bentablestart,(*find other table end*)
    bentableend = 
      SelectFirst[tableendnearest[[All, 2]], 
       Function[Less[bentablestart, #]]];];
   bentable = 
    ImportString[
     StringTake[formDEF14A, {bentablestart, bentableend}], {"HTML", 
      "Data"}];
   Return[bentable];];

Hans

POSTED BY: Hans Michel

I'm not able to take a deep look at the file right now, but since (X)HTML is a subset of XML, parsing your HTML as XML using Mathematica's XML processing capabilities might make handling the data easier.

Additionally, Mathematica can directly import XHTML, but I'm not sure how easy that is to work with.

POSTED BY: Jesse Friedman
Posted 10 years ago

Thanks Gustavo,

Sorry for my delay in responding. I'm going to give

Import[page,{"HTML","Data"}]

a try. Looks promising.

Gregory

POSTED BY: Gregory Lypny
Posted 10 years ago
POSTED BY: Gregory Lypny
Posted 10 years ago

Thanks Hans,

I'll study your code.

Gregory

POSTED BY: Gregory Lypny
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract