How many words did Shakespeare know?
There is the eternal question about how many words Shakespeare knew. He was certainly a literary genius, perhaps the greatest one ever, and his vocabulary alone does certainly not qualify as a sole measure for this, but the question of how many words he knew is still interesting. There is a frequently cited paper, that estimates his vocabulary to contain about 66500 words. The statistics behind their approach is quite complex. The general idea is that Shakespeare did not use all the words he knew in his plays. In every further play we might have used more words; but because he has already used many words in previous texts the increase becomes slower and slower. Only for an infinitely long text we could learn everything about his vocabulary. I will use an approach that is very different from theirs and much simpler to generate my own estimate. Using the link that @Vitaliy Kaurov provided (see file attached)
texts = Import["/Users/thiel/Desktop/will_play_text.csv.xls"];
words = Flatten[TextWords /@ (StringReplace[StringSplit[#, ";"], "\"" -> ""] & /@ texts[[1 ;;, -1]])[[All, -1]]];
words // Length
I find that there are 775418 words in all plays. Deleting all duplicates we get:
DeleteDuplicates[words] // Length
31665 words being actually used in his work, which is close to what they find in the paper. They also note that they count every distinguishable pattern of letters as different words, i.e. "girls" is different from "girl", which is basically what this word count above also does. I do think, however, that we should transform everything to lower case at least.
DeleteDuplicates[ToLowerCase[words]] // Length
This gives 26620 words. We do not appear to be off by much.
Select[Tally[ToLowerCase[BaseForm /@ words]], #[[2]] == 1 &] // Length
or 14243 of which occur exactly once. We can make a little table to see how many words occur exactly how many times:
BarChart[wordrepetitions[[All, 2]], ChartLabels -> wordrepetitions[[All, 1]], AxesLabel -> {"# of occurrences", "# words"}]

Here's the idea. I need to figure out how many words there would be in an "infinitely long text" of Shakespeare. Needless to say that I don't have one. So let's assume that there was such a text, and the "real" texts that Shakespeare wrote are part of it. Let's also assume that I get all the words in his texts in chunks of 5000 or so. I will have a certain number of unique words in the first chunk. Then I will find some more when I get the second one and so on. Let's ask Mathematica to do that for us:
Monitor[knownwords =
Table[{M, N[Length[DeleteDuplicates[ToLowerCase@words[[1 ;; M]]]]]}, {M, 100, Length[words], 5000}];, M];
ListPlot[knownwords, AxesLabel -> {"Words in known texts", "Unique words"}, LabelStyle -> Directive[Bold, Medium], ImageSize -> Large]

As expected the observed number of words he uses increases with the length of the texts that we got from him, but it would saturate somewhere. Our question now is at what value does it saturate?
My first assumption was that we should fit something like a saturation curve of a capacitor: thinking about Poisson processes etc we could expect something proportional to
$1-e^{-b x}$. The thing is that this does not give a very good fit.
Show[ListPlot[knownwords], Plot[nlmfirst[x], {x, 0, Length[words]}, PlotStyle -> Red], Frame -> True]

So I instead use a slight generalisation of my "model":
nlm = NonlinearModelFit[knownwords, a (1. - Exp[b x^c]), {{a, 40000}, {b, -0.001}, {c, 0.6}}, x]
This evaluates to:
nlm[x]

This looks much better:
Show[ListPlot[knownwords, LabelStyle -> Directive[Bold, Medium], ImageSize -> Large],
Plot[nlm[x], {x, 0, Length[words]}, PlotStyle -> Red], Frame -> True]

The leading parameter of the fitted function tells us that the estimated vocabulary of Shakespeare would consist of about (haha!!!) 54973 words. We can get this by using the function and evaluate it for an "infinitely long" text:
Limit[nlm[x], x -> Infinity]
(*54973.1*)
This is slightly lower than the 66500 number that they give, but it is in the right ball-park and it is not necessarily a worse estimate. I can also look at the parameter confidence intervals
nlm["ParameterConfidenceIntervals"]
to get
{{51606.7, 58339.5}, {-0.000142569, -0.000126538}, {0.616614, 0.63681}}
So somewhere between 51600 and 58340 words. It is interesting to think about the parameter
$c$ and what it might mean. It appears that the number of new words increases more slowly than with
$Log(b x)$. The fit is quite good, so perhaps some interpretation of this heuristic model parameter could be found. For example if the topics of the texts would be somewhat limited (e.g. be about love and hate, war and peace, and everything that moves people) it could be that we do not explore the entire vocabulary or do so more slowly. Perhaps he did know things, say about science, that he did not fully discuss in his plays. See also here and here. This is of course very wild speculation and bare of any evidence whatsoever.
Just to make sure I wanted to do the same thing with the Gutenberg text, which contains slightly different material. As above
shakespeare =
Import["http://www.gutenberg.org/cache/epub/100/pg100.txt"];
titles = (StringSplit[#, "\n"] & /@
StringTake[
StringSplit[shakespeare,
"by William Shakespeare"][[1 ;;]], -45])[[;; -3, -1]];
textssplit = (StringSplit[shakespeare,
"by William Shakespeare"][[2 ;; -2]]);
alltexts =
Table[{titles[[i]],
StringDelete[textssplit[[i]],
"<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE
DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY. PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>"]}, {i, 1, Length[titles]}];
This contains
wordsGBP = Flatten[TextWords /@ Flatten[alltexts[[All, 2]]]];
Dimensions[wordsGBP]
899351 words - so it is a bit longer. Same thing as before:
Monitor[knownwordsGBP =
Table[{M, N[Length[DeleteDuplicates[ToLowerCase@wordsGBP[[1 ;; M]]]]]}, {M,100, Length[wordsGBP], 5000}];, M]
nlmGBP = NonlinearModelFit[knownwordsGBP, a (1. - Exp[b x^c]), {{a, 40000}, {b, -0.001}, {c, 0.6}}, x]
gives

so 53181 words, which is quite close to our first estimate.
Show[ListPlot[knownwordsGBP], Plot[nlmGBP[x], {x, 0, Length[wordsGBP]}, PlotStyle -> Red], Frame -> True]

As a final check we can try what happens if we do not convert everything to lower case - probably similar to what the authors of the paper did.
Monitor[knownwordsUPPERcase =
Table[{M, N[Length[DeleteDuplicates[words[[1 ;; M]]]]]}, {M, 100, Length[words], 5000}];, M]
nlmUPPERcase =
NonlinearModelFit[knownwordsUPPERcase, a (1. - Exp[b x^c]), {{a, 40000}, {b, -0.001}, {c, 0.6}}, x]
This gives

and hence and estimate of 70736 words. If I do the same thing with the Gutenberg Project book I get an estimate of 69051 words, which is quite close to what they find in the paper. The confidence intervals
nlmUPPERcaseGBP["ParameterConfidenceIntervals"]
(*{{65319.6, 72782.7}, {-0.0000898997, -0.0000789692}, {0.647895, 0.667528}}*)
now contain the estimate given in the paper.
If you read their paper you will find that they use much more sophisticated arguments and I find it surprising and reassuring that this simple approach gives constant estimates. In fact, assuming that words in upper or lower case are still the same words, my lower estimates seem to be more probable.
So in conclusion, Shakespeares vocabulary was probably around 55000 words, which is about 10000 words lower than the often cited estimate. Of course, you might -with quite some good reason? - say that all of this is Woodoo rather than reliable science.
Also, Shakespeare's genius is certainly not diminished if this vocabulary was 20% smaller - it is still huge. His art is much more than just numbers of words. There is a beautiful article about the "Unholy Trinity" in Macbeth - the symbolism behind the number three. I have some results on that. If anyone cares I can post that, too.
Cheers,
M.
PS: I am sorry for the very speculative parts...
Attachments: