Hi Hans,
Nice to hear from you, and sorry for my delay in responding. I learned a lot from the code you gave me, and I modified it, not so much to do something different, but to match my own coding style. I agree with what you say about the challenges of extracting the data. The SEC is not always consistent in the way the data is entered, so there will be a lot of work in refining the code to grab compensation tables and ownership tables.
Below is my work-in-progress. Following the code you gave me, this function grabs all of the DEF14A text links by searching for CIK. It then returns a number of results. The ownership table results are still incomplete. If you play with it, run it without displaying the results because the output will be big.
I'm going to try your suggestion about using ImportString.
Regards,
Gregory
processSECDEF14A[cik_] :=
Module[{paddedCIK, urlFullPath, searchResults, textLinks, numLinks,
formDEF14A, formDEF14AHTML, formDEF14ATables,
companyInfoVarNames,
companyInfoStartPos,
companyInfoEndPos,
companyInfoRaw,
companyInfoArray,
companyInfoForThisFiling,
companyInfoVarPos,
companyInfoVarData,
resultsTable, textStartpos, theTables, ownershipTablePos},
(*Search is by central index key*)
paddedCIK = IntegerString[ToExpression[cik], 10, 10];
urlFullPath =
"http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D" <> paddedCIK <>
"+TYPE%3DDEF&first=1994&last=2014";
(*Grab all hyperlinks from search results page*)
searchResults = Import[urlFullPath, "Hyperlinks"];
textLinks =
DeleteDuplicates[
Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]]];
numLinks = Length[textLinks];
(*EXTRACT DATA AND PROCESS*)
(*Names of variables about the filing companies*)
companyInfoVarNames = {"CENTRAL INDEX KEY", "FILED AS OF DATE",
"ACCESSION NUMBER", "CONFORMED SUBMISSION TYPE",
"PUBLIC DOCUMENT COUNT", "CONFORMED PERIOD OF REPORT",
"DATE AS OF CHANGE", "EFFECTIVENESS DATE",
"COMPANY CONFORMED NAME", "STANDARD INDUSTRIAL CLASSIFICATION",
"IRS NUMBER", "STATE OF INCORPORATION", "FISCAL YEAR END",
"FORM TYPE", "SEC ACT", "SEC FILE NUMBER", "DATE OF NAME CHANGE",
"FORMER CONFORMED NAME", "DATE OF NAME CHANGE", "STREET 1",
"STREET 2", "CITY", "STATE", "ZIP", "BUSINESS PHONE"};
(* resultsTable contains tables of information extracted from all \
of the filings.
It is created by looping through all of the text links. *)
resultsTable = Table[
(*Grab the HTML for DEF14A*)
(*formDEF14A=Import[theTextLink,
"Plaintext"]; Not used currently *)
= Import[theTextLink];(*Default import*)
formDEF14ATables =
Import[theTextLink, {"HTML",
"Data"}];(*Import while attempting to grab tables*)
(*---Company info---*)
companyInfoStartPos =
Flatten[StringPosition[formDEF14A, "<SEC-HEADER>", 1,
Overlaps -> False]][[1]];
companyInfoEndPos =
Flatten[StringPosition[formDEF14A, "</SEC-HEADER>", 1,
Overlaps -> False]][[1]];
companyInfoRaw =
StringTake[
formDEF14A, {companyInfoStartPos, companyInfoEndPos}];
(*Split company info into an array with variable names and data*)
companyInfoArray =
Select[StringTrim /@
StringSplit[
Select[ReadList[StringToStream[companyInfoRaw], String],
StringMatchQ[#, ___ ~~ ":" ~~ ___] && !
StringMatchQ[#, "<" ~~ ___] &], ":"],
Length[#] == 2 && #[[2]] =!= "" &];
(*companyInfoForThisFiling is a list or row for each filing*)
companyInfoForThisFiling = Table[
companyInfoVarPos =
Position[companyInfoArray,
thisCompanyInfoVar];(*Position of the variable in the array*)
companyInfoVarData =
If[companyInfoVarPos =!= {},
companyInfoArray[[companyInfoVarPos[[1, 1]],
companyInfoVarPos[[1, 2]] + 1]],
"NA"];(*Data extracted for that variable*)
companyInfoVarData = Which[
thisCompanyInfoVar =!= "STANDARD INDUSTRIAL CLASSIFICATION",
companyInfoVarData,
thisCompanyInfoVar == "STANDARD INDUSTRIAL CLASSIFICATION" &&
StringCases[companyInfoVarData,
"[" ~~ __ ~~ "]"] =!= {}, {StringTrim[
StringReplace[companyInfoVarData, "[" ~~ __ ~~ "]" -> ""]],
StringCases[companyInfoVarData,
"[" ~~ sic__ ~~ "]" -> sic]},
thisCompanyInfoVar == "STANDARD INDUSTRIAL CLASSIFICATION" &&
StringCases[companyInfoVarData,
"[" ~~ __ ~~ "]"] == {}, {"NA", "NA"}
];
companyInfoVarData,
{thisCompanyInfoVar,
companyInfoVarNames}];(*End of processing the company info for \
this filing*)
(*---Ownership info---*)
(*Work in progress*)
ownershipTablePos =
StringPosition[formDEF14A,
"SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND \
MANAGEMENT"];
(*Each of the numLinks rows of resultsTable contains the \
following*)
{
formDEF14A,(*The default import of DEF14A in case it needs to be \
examined*)
companyInfoArray,(*Raw array of company info for debugging;
can be removed later*)
Flatten[companyInfoForThisFiling],(*A neat table of company \
information*)
ownershipTablePos,(*A stab at the position of the ownership \
table if it exists [in progress]*)
formDEF14ATables(*A stab a grab any data that is tabular*)
},
{theTextLink, textLinks}];(*End resultsTable*)
(*RESULTS RETURNED BY THE FUNCTION*)
{
resultsTable[[All, 1]](*formDEF14A*),
resultsTable[[All, 2]](*companyInfoArray for debugging*),
resultsTable[[All, 3]](*Company info table*),
resultsTable[[All, 4]](*Ownership table position*),
resultsTable[[All, 5]](*form DEF14A imported as HTML data [
tables]*),
numLinks (*Number of text links that were found on the search \
results page*),
textLinks (*Text links that that were found on the search results \
page*)
}
]