Hi Young,
Looks like there are issues with the code in "Notebook-9-3-Conservative Range COPY.nb", several cells failed to evaluate (I did not evaluate any cells, the errors are present in the notebook). I have a hard time following the code in the notebook.
Some expressions have much simpler implementations
StringReplace[#, (StartOfString ~~ Whitespace) | (Whitespace ~~ EndOfString) :> ""] & /@ data0;
is the same as the following, assuming you want to remove all whitespaces from the beginning and end of the string.
StringTrim@data1
Some parts of the code look incorrect
y2019 = Length[StringPosition[data2, "2019"]];
y2018 = Length[StringPosition[data2, "2018"]];
y2017 = Length[StringPosition[data2, "2017"]];
y2016 = Length[StringPosition[data2, "2016"]];
y2015 = Length[StringPosition[data2, "2015"]];
y2014 = Length[StringPosition[data2, "2014"]];
y2013 = Length[StringPosition[data2, "2013"]];
y2012 = Length[StringPosition[data2, "2012"]];
y2011 = Length[StringPosition[data2, "2011"]];
y2010 = Length[StringPosition[data2, "2010"]];
Since data2
is a list of strings, StringPosition
is going to return a list for every string (empty list for no match). The Length
is going to be identical in all cases so this is not going to give the right fiscal year
fiscalyear =
Max[y2019, y2018, y2017, y2016, y2015, y2014, y2013, y2012, y2011, y2010];
Rather than enumerating every year, why not search for a pattern
"20"~~DigitCharacter~~DigitCharacter
Also not clear why the Max
is used to determine fiscal year. StringPosition
is going to return the position in each string, not the position in the entire document.
When the following is evaluated, only a
and b
are defined, so Max
will not evaluate, so If
will not evaluate.
truename =
If[a == Max[a, b, c, d, e, f, g, h, i, j, k],
"name of registrant as specified in its charter)",
If[b == Max[a, b, c, d, e, f, g, h, i, j, k],
"name of registrant as specified in its certificate)"]];
The rest of the code has even more complex logic that looks incorrect and there are no comments in the code.
Anyway, assuming you are able to get your code to work correctly, wrap it in a function that accepts the proxy url and returns the desired result. Here is an example that just returns an association of word and sentence counts.
processProxyURL[proxyURL_] :=
Module[{plainText, allWords, nonStopWords, sentences},
plainText = Import[proxyURL, "Plaintext"];
allWords = TextWords@plainText;
nonStopWords = DeleteStopwords@allWords;
sentences = TextSentences@plainText;
<|"Words" -> Length@allWords,
"Non Stopwords" -> Length@nonStopWords,
"Sentences" -> Length@sentences|>
]
To run it on every proxy URL in the dataset and add the result as a new column
dataWithProxyStatementURL[All, <|#, "Processed Data" -> processProxyURL[#["proxyStatementURL"]]|> &]