Message Boards Message Boards

Extracting Data From an Imported HTML Document

Posted 9 years ago

Hello everyone,

I'm trying to extract data on share ownership from a document called DEF14A, which is a form filed by companies with the Securities and Exchange Commission. I am able to import the documents in HTML from the SEC using Mathematica. If a document contains a share ownership table, the table is usually preceded by the (start) string "SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT" and ends with the tag that indicates the end of a table (the tag won't appear in this post if I type it here). I have been using StringPosition with the start string and the tag to try to pull out the table and some surrounding text but have had mixed results because it appears that start string may not be found if the person entering it inserted a carriage return somewhere within the string, and because I don't fully understand StringPosition's treatment of overlaps: I want simply the first occurrence of the start string followed by the tag with anything in between except the start string and the end tag.

I'd very much appreciate any tips on extracting the owners' names and the number of shares they hold. I've attached an HTML snippet of the DEF14A, although it is long and probably not of much interest to anyone.

Regards,

Gregory

Attachments:
POSTED BY: Gregory Lypny
7 Replies
Posted 9 years ago

Thanks Hans,

I'll study your code.

Gregory

POSTED BY: Gregory Lypny

Gregory: Try the following function

getBeneficialfromSECDEF14A[cik_] := 
  Module[{paddedCIK, urlfullpath, searchResults, textOnlylinks, 
    top1linkfromList, formDEF14A, htmltagstartpos, htmltagendpos, 
    htmlDEF14A, DEF14ANoAttribs, tablestartpos, tablestartnearfunc, 
    tableendpos, tableendnearfunc, benpos, tablestartnearest, 
    tableendnearest, 
    tablestarttally, tablestartcommon, tableendtally, tableendcommon, 
    bentablestart, bentableend, bentable},
   paddedCIK = IntegerString[ToExpression[cik], 10, 10];
   urlfullpath = 
    "http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> paddedCIK <>
      "+TYPE%3DDEF&first=1994&last=" <> 
     DateString[DateList[], "Year"] <> "";
   searchResults = Import[urlfullpath, "Hyperlinks"];
   textOnlylinks = 
    Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]];
   top1linkfromList = First[textOnlylinks];
   formDEF14A = Import[top1linkfromList, "Plaintext"];

   htmltagstartpos = 
    StringPosition[formDEF14A, "<html>", IgnoreCase -> True];
   htmltagendpos = 
    StringPosition[formDEF14A, "</html>", IgnoreCase -> True];
   htmlDEF14A = 
    StringTake[
     formDEF14A, {First[Flatten[htmltagstartpos]], 
      Last[Flatten[htmltagendpos]]}];

   DEF14ANoAttribs = 
    StringReplace[htmlDEF14A, 
     RegularExpression["(<\\w+)[^>]*(>)"] -> "$1$2"];
    DEF14ANoAttribs = 
    StringReplace[
     DEF14ANoAttribs, {"<br>" -> " ", "<hr>" -> " ", "&nbsp;" -> " ", 
      "<u>" -> "", "</u>" -> "", "<b>" -> "", "</b>" -> "", 
      "</B>" -> "", "<font>" -> " ", "</font>" -> " ", 
      "<FONT>" -> " ", "</FONT>" -> " ", "<small>" -> "", 
      "</small>" -> "", "> " -> ">", " <" -> "<", "  " -> " "}, 
     IgnoreCase -> True];
    DEF14ANoAttribs = 
    StringReplace[DEF14ANoAttribs, RegularExpression["\\n\\n"] -> ""];
    DEF14ANoAttribs = ReplaceRepeated[DEF14ANoAttribs, {"  " -> " "}];
   tablestartpos = 
    StringPosition[DEF14ANoAttribs, "<table>", IgnoreCase -> True];
   tablestartnearfunc = Nearest[tablestartpos];

   tableendpos = 
    StringPosition[DEF14ANoAttribs, "</table>", IgnoreCase -> True];
   tableendnearfunc = Nearest[tableendpos];

   benpos = 
    StringPosition[DEF14ANoAttribs, "beneficial", IgnoreCase -> True];
   (*bnearfunc=Nearest[bpos];*)

   tablestartnearest = Flatten[Map[tablestartnearfunc, benpos], 1];
   tableendnearest = Flatten[Map[tableendnearfunc, benpos], 1];
   tablestarttally = Tally[tablestartnearest];
   tablestartcommon = Commonest[tablestartnearest];
   tableendtally = Tally[tableendnearest];
   tableendcommon = Commonest[tableendnearest];
   bentablestart = Min[tablestartcommon];
   bentableend = Min[tableendcommon];
   If[bentableend < bentablestart, 
    (* find other table end*)
    bentableend = 
      SelectFirst[tableendnearest[[All, 2]], bentablestart < # &];
    ];
   bentable = 
    ImportString[
     StringTake[
      DEF14ANoAttribs, {bentablestart, bentableend}], {"HTML", 
      "Data"}];
   Return[bentable];];

With a test of HPQ (Hewlett Packard) cik of 47217

getBeneficialfromSECDEF14A[47217]
{{{"Name of Beneficial Owner", 
   "Shares of Common Stock Beneficially Owned", 
   "Percent of Common Stock Outstanding"}, {"Dodge & Cox (1)", 
   "171,145,618", 9., "%"}, {"State Street Corporation (2)", 
   "97,792,253", 5.1, "%"}, {"Marc L. Andreessen (3)", "40,740", 
   "*"}, {"Shumeet Banerji", "32,694", "*"}, {"Robert R. Bennett", 
   "4,262", "*"}, {"Rajiv L. Gupta (4)", "71,271", 
   "*"}, {"Klaus Kleinfeld", "\[LongDash]", 
   "*"}, {"Raymond J. Lane (5)", "462,618", 
   "*"}, {"Ann M. Livermore (6)", "318,742", 
   "*"}, {"Raymond E. Ozzie", "4,262", "*"}, {"Gary M. Reiner (7)", 
   "82,535", "*"}, {"Patricia F. Russo (8)", "20,888", 
   "*"}, {"James A. Skinner", "4,262", 
   "*"}, {"Margaret C. Whitman (9)", "4,419,346", 
   "*"}, {"Catherine A. Lesjak (10)", "875,905", 
   "*"}, {"William L. Veghte (11)", "385,953", 
   "*"}, {"Dion J. Weisler (12)", "12,500", 
   "*"}, {"Michael G. Nefkens (13)", "461,979", 
   "*"}, {"All current executive officers and directors as a group \
(24 persons) (14)", "7,838,018", "*"}}, {"*", 
  "Represents holdings of less than 1%."}}

I also tried this with Microsoft and the function seems to have extracted the Beneficial Owner table

getBeneficialfromSECDEF14A[789019]

Give this a trial run it may be slow you may wish to remove the attribute and other html cleaning. I was also taking advantage of the built-in Nearest function. Maybe this function could be optimized by using the correct options.

(Addition) Here is a different version without the cleaning of the HTML leaving that to built-in commands:

getBeneficialfromSECDEF14A[cik_] := 
  Module[{formDEF14A, tablestartpos, tablestartnearfunc, tableendpos, 
    tableendnearfunc, benpos, tablestartnearest, tableendnearest, 
    tablestartcommon, tableendcommon, bentablestart, bentableend, 
    bentable} ,
   formDEF14A = 
    Import[SelectFirst[
      Import["http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> 
        IntegerString[ToExpression[cik], 10, 10] <> 
        "+TYPE%3DDEF&first=1994&last=" <> 
        DateString[DateList[], "Year"], "Hyperlinks"], 
      Function[StringMatchQ[#, "*.txt"] == True]], "Plaintext"];
   tablestartpos = 
    StringPosition[formDEF14A, "<table", IgnoreCase -> True];
   tablestartnearfunc = Nearest[tablestartpos];
   tableendpos = 
    StringPosition[formDEF14A, "</table>", IgnoreCase -> True];
   tableendnearfunc = Nearest[tableendpos];
   benpos = 
    StringPosition[formDEF14A, "beneficial", IgnoreCase -> True];
   tablestartnearest = Flatten[Map[tablestartnearfunc, benpos], 1];
   tableendnearest = Flatten[Map[tableendnearfunc, benpos], 1];
   tablestartcommon = Commonest[tablestartnearest];
   tableendcommon = Commonest[tableendnearest];
   bentablestart = Min[tablestartcommon];
   bentableend = Min[tableendcommon];
   If[bentableend < bentablestart,(*find other table end*)
    bentableend = 
      SelectFirst[tableendnearest[[All, 2]], 
       Function[Less[bentablestart, #]]];];
   bentable = 
    ImportString[
     StringTake[formDEF14A, {bentablestart, bentableend}], {"HTML", 
      "Data"}];
   Return[bentable];];

Hans

POSTED BY: Hans Michel

Gregory: I believe you might be using the starter function I responded to a while back. Below is slight modification just change hard coded year to get most recent year, so this returns the most recent raw text of the DEF14A filing from a list range 1994 to current year for a given ticker

processSECDEF14A[ticker_] := 
  Module[{cik, paddedCIK, urlfullpath, searchResults, textOnlylinks, 
    top1linkfromList, formDEF14A}, cik = FinancialData[ticker, "CIK"];
   paddedCIK = IntegerString[ToExpression[cik], 10, 10];
   urlfullpath = 
    "http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> paddedCIK <>
      "+TYPE%3DDEF&first=1994&last=" <> 
     DateString[DateList[], "Year"] <> "";
   searchResults = Import[urlfullpath, "Hyperlinks"];
   textOnlylinks = 
    Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]];
   top1linkfromList = First[textOnlylinks];
   formDEF14A = Import[top1linkfromList, "Plaintext"];
   Return[formDEF14A];];

I tried this for HEWLETT PACKARD CO

hpDEF14A = processSECDEF14A["HPQ"];

The raw data is as previously stated, SGML,HTML, JPG, PDF combo document. In order find the beneficial ownerships (section16a) there will be some data mining necessary. The raw data HTML segments are created by many different edgar filing service providers (software). But all of them are convoluted HTML so I would suggest working with both the raw file and a cleaner version. The following simple statement will remove all HTML element attributes:

hpDEF14ANoAttribs = StringReplace[hpDEF14A, 
 RegularExpression["(<\\w+)[^>]*(>)"] -> "$1$2"];

You can perform some further cleaning such as removing font elements or any other cleaning necessary. The trick as you are trying to accomplish is to get the HTML Table that contains the beneficial ownership list. Footnote would have to be dealt with later. But using:

ImportString["Table Fragment Here", {"HTML", "Data"}];

I'm still thinking about a way to grab the required table. These HTML files are so complicated that I don't think "FullData" would return a clean data set. StringPosition may come in handy but figuring out the proper algorithm is more critical to find a solution. Something like find html table that contains the word "beneficial", etc

POSTED BY: Hans Michel
Posted 9 years ago

Hi Hans, Nice to hear from you, and sorry for my delay in responding. I learned a lot from the code you gave me, and I modified it, not so much to do something different, but to match my own coding style. I agree with what you say about the challenges of extracting the data. The SEC is not always consistent in the way the data is entered, so there will be a lot of work in refining the code to grab compensation tables and ownership tables.

Below is my work-in-progress. Following the code you gave me, this function grabs all of the DEF14A text links by searching for CIK. It then returns a number of results. The ownership table results are still incomplete. If you play with it, run it without displaying the results because the output will be big.

I'm going to try your suggestion about using ImportString.

Regards,

Gregory

processSECDEF14A[cik_] := 
 Module[{paddedCIK, urlFullPath, searchResults, textLinks, numLinks, 
   formDEF14A, formDEF14AHTML, formDEF14ATables,

   companyInfoVarNames,
   companyInfoStartPos,
   companyInfoEndPos,
   companyInfoRaw,
   companyInfoArray,
   companyInfoForThisFiling,
   companyInfoVarPos,
   companyInfoVarData,

   resultsTable, textStartpos, theTables, ownershipTablePos},

  (*Search is by central index key*)

  paddedCIK = IntegerString[ToExpression[cik], 10, 10];
  urlFullPath = 
   "http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> paddedCIK <>
     "+TYPE%3DDEF&first=1994&last=2014";

  (*Grab all hyperlinks from search results page*)

  searchResults = Import[urlFullPath, "Hyperlinks"];
  textLinks = 
   DeleteDuplicates[
    Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]]];
  numLinks = Length[textLinks];

  (*EXTRACT DATA AND PROCESS*)

  (*Names of variables about the filing companies*)

  companyInfoVarNames = {"CENTRAL INDEX KEY", "FILED AS OF DATE", 
    "ACCESSION NUMBER", "CONFORMED SUBMISSION TYPE", 
    "PUBLIC DOCUMENT COUNT", "CONFORMED PERIOD OF REPORT", 
    "DATE AS OF CHANGE", "EFFECTIVENESS DATE", 
    "COMPANY CONFORMED NAME", "STANDARD INDUSTRIAL CLASSIFICATION", 
    "IRS NUMBER", "STATE OF INCORPORATION", "FISCAL YEAR END", 
    "FORM TYPE", "SEC ACT", "SEC FILE NUMBER", "DATE OF NAME CHANGE", 
    "FORMER CONFORMED NAME", "DATE OF NAME CHANGE", "STREET 1", 
    "STREET 2", "CITY", "STATE", "ZIP", "BUSINESS PHONE"};

  (* resultsTable contains tables of information extracted from all \
of the filings. 
  It is created by looping through all of the text links. *)

  resultsTable = Table[
    (*Grab the HTML for DEF14A*)
    (*formDEF14A=Import[theTextLink,
    "Plaintext"]; Not used currently *)

    formDEF14A = Import[theTextLink];(*Default import*)

    formDEF14ATables = 
     Import[theTextLink, {"HTML", 
       "Data"}];(*Import while attempting to grab tables*)

    (*---Company info---*)

    companyInfoStartPos = 
     Flatten[StringPosition[formDEF14A, "<SEC-HEADER>", 1, 
        Overlaps -> False]][[1]];
    companyInfoEndPos = 
     Flatten[StringPosition[formDEF14A, "</SEC-HEADER>", 1, 
        Overlaps -> False]][[1]];
    companyInfoRaw = 
     StringTake[
      formDEF14A, {companyInfoStartPos, companyInfoEndPos}];
    (*Split company info into an array with variable names and data*)

        companyInfoArray = 
     Select[StringTrim /@ 
       StringSplit[
        Select[ReadList[StringToStream[companyInfoRaw], String], 
         StringMatchQ[#, ___ ~~ ":" ~~ ___] && ! 
            StringMatchQ[#, "<" ~~ ___] &], ":"], 
      Length[#] == 2 && #[[2]] =!= "" &];
    (*companyInfoForThisFiling is a list or row for each filing*)

    companyInfoForThisFiling = Table[
      companyInfoVarPos = 
       Position[companyInfoArray, 
        thisCompanyInfoVar];(*Position of the variable in the array*)


      companyInfoVarData = 
       If[companyInfoVarPos =!= {}, 
        companyInfoArray[[companyInfoVarPos[[1, 1]], 
          companyInfoVarPos[[1, 2]] + 1]], 
        "NA"];(*Data extracted for that variable*)

      companyInfoVarData = Which[
        thisCompanyInfoVar =!= "STANDARD INDUSTRIAL CLASSIFICATION", 
        companyInfoVarData,
        thisCompanyInfoVar == "STANDARD INDUSTRIAL CLASSIFICATION" && 
         StringCases[companyInfoVarData, 
           "[" ~~ __ ~~ "]"] =!= {}, {StringTrim[
          StringReplace[companyInfoVarData, "[" ~~ __ ~~ "]" -> ""]], 
         StringCases[companyInfoVarData, 
          "[" ~~ sic__ ~~ "]" -> sic]},
        thisCompanyInfoVar == "STANDARD INDUSTRIAL CLASSIFICATION" && 
         StringCases[companyInfoVarData, 
           "[" ~~ __ ~~ "]"] == {}, {"NA", "NA"}
        ];
      companyInfoVarData,
      {thisCompanyInfoVar, 
       companyInfoVarNames}];(*End of processing the company info for \
this filing*)

    (*---Ownership info---*)
    (*Work in progress*)

    ownershipTablePos = 
     StringPosition[formDEF14A, 
      "SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND \
MANAGEMENT"];

    (*Each of the numLinks rows of resultsTable contains the \
following*)
    {
     formDEF14A,(*The default import of DEF14A in case it needs to be \
examined*)

     companyInfoArray,(*Raw array of company info for debugging; 
     can be removed later*)

     Flatten[companyInfoForThisFiling],(*A neat table of company \
information*)

     ownershipTablePos,(*A stab at the position of the ownership \
table if it exists [in progress]*)

     formDEF14ATables(*A stab a grab any data that is tabular*)
     },
    {theTextLink, textLinks}];(*End resultsTable*)

  (*RESULTS RETURNED BY THE FUNCTION*)
  {
   resultsTable[[All, 1]](*formDEF14A*),
   resultsTable[[All, 2]](*companyInfoArray for debugging*),
   resultsTable[[All, 3]](*Company info table*),
   resultsTable[[All, 4]](*Ownership table position*),
   resultsTable[[All, 5]](*form DEF14A imported as HTML data [
   tables]*),
   numLinks (*Number of text links that were found on the search \
results page*),
   textLinks (*Text links that that were found on the search results \
page*)
   }
  ]
POSTED BY: Gregory Lypny

My advice is to try with the Import[] command with a syntax like:

Import["http://example.com/abc.html", {"HTML", "Data"}]

You can further refine the imported data specifying subelements like:

Import["http://example.com/abc.html", {"HTML", "Data",2}]
Import["http://example.com/abc.html", {"HTML", "Data",2,1}]

and so on. (Instead on 2 and 1 use the numbers that make sense for your data)

Then use Cases[] to complete the task.

This is easier than working with the XML representation. Give it a try.

POSTED BY: Gustavo Delfino
Posted 9 years ago

Thanks Gustavo,

Sorry for my delay in responding. I'm going to give

Import[page,{"HTML","Data"}]

a try. Looks promising.

Gregory

POSTED BY: Gregory Lypny

I'm not able to take a deep look at the file right now, but since (X)HTML is a subset of XML, parsing your HTML as XML using Mathematica's XML processing capabilities might make handling the data easier.

Additionally, Mathematica can directly import XHTML, but I'm not sure how easy that is to work with.

POSTED BY: Jesse Friedman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract