Group Abstract Group Abstract

Message Boards Message Boards

400th anniversary of Shakespeare's death

Posted 10 years ago
POSTED BY: Vitaliy Kaurov
8 Replies

You are completely right David! Somehow I assumed that was due to the domineering qualities of red:

Red - is a bold color that commands attention! Red gives the impression of seriousness and dignity, represents heat, fire and rage, it is known to escalate the body's metabolism. Red can also signify passion and love. Red promotes excitement and action. It is a bold color that signifies danger, which is why it's used on stop signs. Using too much red should be done with caution because of its domineering qualities. Red is the most powerful of colors. The Psychology of Color

Thanks for noticing it. I've just updated the color names above:

colornames = StringJoin[" ", #, " "] & /@ colornames;

Now it looks just right:

Colors Shakespeare

POSTED BY: Bernat Espigulé
POSTED BY: David Gathercole
Attachments:
POSTED BY: Bernat Espigulé
Attachments:
POSTED BY: Marco Thiel

I took this opportunity to use the Wolfram Language on XML documents, specifically a TEI (http://www.tei-c.org) version of "The Tempest," which can be found at the University of Oxford Text Archive (http://ota.ox.ac.uk).

We start by importing the xml document as an XMLObject.

tempestxml = Import["http://ota.ox.ac.uk/text/5725.xml", "XMLObject"];

After exploring the document a little, I found that I could extract lines by a specific speaker with Cases. Here is how to get all the lines spoken by Prospero:

proslines = 
  Cases[tempestxml, XMLElement["sp", {}, {XMLElement["speaker", _, {"Pros."}], line_}] :> line, Infinity];

I made a WordCloud of those lines, only to discover that we lack Elizabethan stopwords:

WordCloud[DeleteStopwords@ToLowerCase@TextWords[StringRiffle[Flatten[proslines//.XMLElement[_,_,content_]:>content]]]]

First Prospero WordCloud

It's much more satisfying after a minor tweak:

WordCloud[DeleteCases[DeleteStopwords@ToLowerCase@TextWords[StringRiffle[Flatten[
    proslines //. XMLElement[_, _, content_] :> content]]], "thee" | "thou" | "thy"]]

Second Prospero WordCloud

Why limit ourselves to one character, though?

linesbychar = First@First@# -> Last /@ # & /@ GatherBy[Cases[tempestxml, 
    XMLElement["sp", {}, {XMLElement["speaker", _, {char_}], line_}] :> char -> line, Infinity], First]; 

Grid[Partition[Column@{First@#, WordCloud[DeleteCases[DeleteStopwords@ToLowerCase@TextWords[
               StringRiffle[Flatten[Last@# //. XMLElement[_, _, content_] :> content]]], 
                    "thee" | "thou" | "thy"]]} & /@ linesbychar, 6], Frame -> All, Alignment -> Left]

WordClouds for all characters

We don't have to limit ourselves to characters, we can make WordCloud for each scene. In this document, each scene is contained in a <div>

scenes = Cases[tempestxml, XMLElement["div", _, div_] :> div, Infinity];
Grid[Partition[Column[{First@#, WordCloud[DeleteCases[DeleteStopwords@ToLowerCase@TextWords[
           StringRiffle[Flatten[Cases[Last@#, XMLElement["ab", _, line_] :> line, Infinity] //. XMLElement[_, _, content_] :> content]]],
             "thee" | "thou" | "thy"]]}] & /@ ((Replace[Flatten[Cases[#, 
                XMLElement["head", _, h_] :> (h //. XMLElement[_, _, content_] :> content)]], {s_String} :> s] -> Rest@#) & /@ scenes), 5],                 
                    Frame -> All, Alignment -> Left]

a WordCloud for each scene

POSTED BY: Aaron Enright

Dear @Diego Zviovich and @Vitaliy Kaurov,

I did not have much time yesterday night so I only did some quite basic things. Here are some more ideas. To put everything into an historical context, we might want to look at important events in Shakespeare's life. There is a website (actually there are zillions of them) which has the data in an easy-to-read form:

TimelinePlot[
 Association[{#[[2]] -> Interpreter["Date"][#[[1]]]} & /@ (StringSplit[#, "   "] & /@ 
 StringSplit[StringSplit[StringSplit[Import["http://www.shmoop.com/william-shakespeare/timeline.html", "Plaintext"], "How It All Went Down"][[2]], "BACK NEXT"][[1]], "\n"][[2 ;; ;; 3]])]] 

enter image description here

On the website there are little snippets of text that explain what happened. It is certainly possible to display them using Tooltip in this TimelinePlot. I also wondered where all the plays of Shakespeare were set. Another website contains the information. I use Interpreter to get the GeoCoordinates. It does not always appear to work. Some dots are in Australia and the US; here I restrict the plot to Europe and the Middle East.

places = Import["http://www.nosweatshakespeare.com/shakespeares-plays/shakespeares-play-locations/", "Data"][[2 ;;, 1, 1, 1, 2]][[1, All, -1]];
gpscoords = Interpreter["Location"][places];
GeoListPlot[Select[gpscoords, Head[#] === GeoPosition &], GeoRange -> GeoBoundingBox[{GeoPosition[{59.64927428005451, \
-22.259507086895418`}], GeoPosition[{26.793037464663843`, 48.84842129323249}]}], GeoBackground -> "ReliefMap", GeoProjection -> "Mercator", ImageSize -> Large]

enter image description here

Syllables and Meter

Ok. Now the next bits are a bit more complicated. I first wondered how the number of syllables would be per verse in the sonnets. Luckily the Wolfram Language has a function for that. But first I set everything up as above and look at the first sonnet.

shakespeare = Import["http://www.gutenberg.org/cache/epub/100/pg100.txt"];
titles = (StringSplit[#, "\n"] & /@ StringTake[StringSplit[shakespeare, "by William Shakespeare"][[1 ;;]], -45])[[;; -3, -1]];
textssplit = (StringSplit[shakespeare, "by William Shakespeare"][[2 ;; -2]]);
alltexts = 
  Table[{titles[[i]], 
    StringDelete[textssplit[[i]], 
     "<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
          SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND 
          IS PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
          WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE
          DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
          PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
          COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
          SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>"]}, {i, 1, Length[titles]}];
allsonnets = StringSplit[StringSplit[alltexts[[1, 2]], "THE END"][[1]], Reverse@(ToString /@ Range[154])][[2 ;;]];
allsonnets[[1]]

enter image description here

The Wolfram Language has WordData built in so I tried using the option "Hyphenation" to try and count the syllables - here for sonnet number 4.

WordData[ToLowerCase[#], "Hyphenation"] & /@ # & /@ (TextWords[#] & /@DeleteCases[StringDelete[StringSplit[allsonnets[[4]], "\n"], ","],""][[1 ;; -2]])

enter image description here

The problem is that to many words cannot be hyphenated automatically. It turns out that Wolfram|Alpha does a better job as discussed here. From there I also take the following function "syllables" which submits a query to Wolfram|Alpha:

ClearAll@syllables;
SetAttributes[syllables, Listable];
syllables[word_String] := Length@WolframAlpha["syllables " <> word, {{"Hyphenation:WordData", 1}, "ComputableData"}]

With that function we can analyse the first 10 verses of sonnet number 4:

Monitor[sonnet4 = 
  Table[{#, syllables[#]} & /@ (TextWords[#] & /@ 
       DeleteCases[StringDelete[StringSplit[allsonnets[[4]], "\n"], ","], ""][[1 ;; -2]])[[k]], {k, 1, Length[DeleteCases[
       StringDelete[StringSplit[allsonnets[[4]], "\n"], ","], ""][[1 ;; -2]]]}], k]

This gives:

TableForm [Reverse /@ # & /@ sonnet4]

enter image description here

where the numbers above the words indicate the estimated number of syllables. Note, that there is sometimes a difference between that number and the perceived number of syllables when you speak. Also, some words were probably pronounced quite differently in Shakespeare's time. Here is the number of syllables per verse:

Total /@ sonnet4[[All, All, 2]]
(*{8, 10, 10, 10, 11, 11, 8, 10, 10, 10, 10, 10, 10, 10}*)

Let's do that for one more sonnet:

Monitor[sonnet5 = 
  Table[{#, syllables[#]} & /@ (TextWords[#] & /@ 
       DeleteCases[StringDelete[StringSplit[allsonnets[[5]], "\n"], ","], ""][[1 ;; -2]])[[k]], {k, 1, 
    Length[DeleteCases[StringDelete[StringSplit[allsonnets[[5]], "\n"], ","], ""][[1 ;; -2]]]}], k]

This gives:

TableForm [Reverse /@ # & /@ sonnet5]

enter image description here

There are obviously some problems, such as "o'er-snowed", but over all I am quite impressed with this result. Here is the syllable count per verse:

Total /@ sonnet5[[All, All, 2]]
(*{9, 10, 9, 10, 8, 10, 10, 9, 10, 11, 10, 10, 10, 10}*)

We can now plot and compare the counts for the two sonnets.

ListLinePlot[{Total /@ sonnet4[[All, All, 2]], 
  Total /@ sonnet5[[All, All, 2]]}, PlotRange -> {All, {0, 12}}, LabelStyle -> Directive[Bold, Medium], AxesLabel -> {"verse #", "syllables"}]

enter image description here

In fact, Shakespeare's sonnets all apart from three, conform to the iambic pentameter, which is described here. On that website they say:

Shakespeare's sonnets are written predominantly in a meter called iambic pentameter, a rhyme scheme in which each sonnet line consists of ten syllables. The syllables are divided into five pairs called iambs or iambic feet. An iamb is a metrical unit made up of one unstressed syllable followed by one stressed syllable.

That is not exactly what we get, but we are close.

Rhyme and Meter

We can also try to figure out which verse rhymes with which. To do this we take the last word of every verse (first for sonnet 4):

(TextWords[#] & /@ DeleteCases[StringDelete[StringSplit[allsonnets[[4]], "\n"], ","], ""][[1 ;; -2]])[[All, -1]]

this gives

{"spend", "legacy", "lend", "free", "abuse", "give", "use", "live", "alone", "deceive", "gone", "leave", "thee", "be"}

By looking that that list we can immediately see the meter, but it would be nice to get this algorithmically. In fact, Wolfram|Alpha has again all we need.

pronounciation = WolframAlpha["IPA spend", {{"Pronunciation:WordData", 1}, "Plaintext"}]

enter image description here

The IPA gives the phonetical transcription, which is what I want. After a little bit of cleaning this is what I get for the first 10 sonnets:

Monitor[rhymingQ = 
  Table[(Quiet[(StringSplit[WolframAlpha["IPA " <> #, {{"Pronunciation:WordData", 1}, "Plaintext"}], {"IPA: ", ")"}][[2]])] /. {"IPA: ",")"} -> Missing["NotAvailable"]) & /@ (TextWords[#] & /@ 
DeleteCases[StringDelete[StringSplit[allsonnets[[l]], "\n"], ","], ""][[1 ;; -2]])[[All, -1]];, {l, 1, 10}], l]

This does not always work, but it is ok:

TableForm[# /. Missing["NotAvailable"] -> "NA" & /@ rhymingQ]

enter image description here

We should be able to work with that. I can now convert the last two symbols to their CharacterCode:

endingsounds = Take[ToCharacterCode[#], -2] & /@ (# /. Missing["NotAvailable"] -> "NA" & /@ rhymingQ)[[1]]
{{78, 65}, {97, 618}, {105, 115}, {605, 105}, {618, 122}, {601, 108}, {618, 122}, {601, 108}, {110, 116}, {78, 65}, {110, 116}, {78,65}, {712, 105}, {78, 65}}

Note that I have converted the Missing bits into "NA". I will want to Ignore the "NA"s. Next, I look for same endings:

Rule @@@ Flatten /@ 
  Select[Position[endingsounds, #] & /@ DeleteDuplicates[Select[endingsounds, # != {78, 65} &]], Length[#] == 2 &]
(*{5 -> 7, 6 -> 8, 9 -> 11}*)

This indicates which verse rhymes with which. I have missed some of them because of the "NA"s, but I hope to make up for that by using several sonnets.

alllinks = {}; Do[endingsounds = 
  Take[ToCharacterCode[#], -2] & /@ (# /. Missing["NotAvailable"] -> "NA" & /@ rhymingQ)[[i]]; 
 AppendTo[alllinks, Rule @@@ Flatten /@ Select[Position[endingsounds, #] & /@ 
 DeleteDuplicates[Select[endingsounds, # != {78, 65} &]], Length[#] == 2 &]], {i, 1, 10}]; alllinks = Flatten[alllinks]
{5 -> 7, 6 -> 8, 9 -> 11, 9 -> 11, 10 -> 12, 5 -> 7, 11 -> 13, 1 -> 3, 4 -> 14, 5 -> 7, 6 -> 8, 10 -> 12, 1 -> 3, 9 -> 11, 13 -> 14, 
 2 -> 4, 13 -> 14, 1 -> 3, 5 -> 7, 6 -> 8, 9 -> 11, 10 -> 12, 1 -> 3, 2 -> 4, 5 -> 7, 10 -> 12, 13 -> 14, 1 -> 3, 2 -> 4, 5 -> 7, 10 -> 12,
  13 -> 14}

Not elegant at all, and I see @Vitaliy Kaurov 's despair at these lines of code ... Anyway, it gives a nice graph.

Graph[alllinks, VertexLabels -> "Name", Background -> Black, EdgeStyle -> Yellow, VertexLabelStyle -> Directive[Red, 15]]

enter image description here

This illustrates the structure of the sonnets. If two nodes are usually linked like 1 and 3, 2 and 5, and 13 and 14 they tend to rhyme. Note that occasionally there are additional links like from 11 to 13 and from 4 to 14.

On the website referenced above they also say:

There are fourteen lines in a Shakespearean sonnet. The first twelve lines are divided into three quatrains with four lines each. In the three quatrains the poet establishes a theme or problem and then resolves it in the final two lines, called the couplet. The rhyme scheme of the quatrains is abab cdcd efef. The couplet has the rhyme scheme gg.

Out network recovers that exact structure. This rhyme structure distinguishes Shakespeare's style from for example Petrarcha's style. I have some interest in Petrarcha's poems so I might post a comparison later.

I also haven't really looked at the meter. I have only tried to count syllables. But the phonetical transcription does contain information about intonation. By combining the two bits of information we might be able to deduce the meter.

Cheers,

M.

POSTED BY: Marco Thiel
Posted 10 years ago

@Marco Thiel , you are indeed fast. My take on the graph of characters, leaving the conversations between characters.

data = SemanticImport[
   "c:\\Users\\Diego\\Downloads\\will_play_text.csv", Automatic, 
   "Rows", Delimiters -> ";"];
data[[All, 
    4]] = (ToExpression@StringSplit[#, "."] & /@ 
     data[[All, 4]]) /. {} -> {Missing[Empty], Missing[Empty], 
     Missing[Empty]};
sentiment = 
  Classify["Sentiment", data[[All, 6]]] /. {"Neutral" -> 0, 
    "Positive" -> 1, "Negative" -> -1, Indeterminate -> 0};
data[[All, 6]] = Transpose[{data[[All, 6]], sentiment}];
ds = Dataset[
  AssociationThread[{"ID", "Play", "Phrase", "Act", "Scene", "Line", 
      "Character", "Text", "Sentiment"}, #] & /@ (Flatten /@ data)];
getEdges[play_, act_, scene_] := 
 DirectedEdge[#[[1]], #[[2]]] & /@ 
  Partition[
   ds[Select[#Play == play &] /* Union, {"Act", "Scene", "Phrase", 
       "Character"}][Select[#Act == act && #Scene == scene &], 
     "Character"] // Normal, 2, 1]
getGraph[play_] := 
 Graph[Flatten[
   getEdges[#[[1]], #[[2]], #[[3]]] & /@ (ds[
         Select[#Play == play &] /* Union, {"Play", "Act", "Scene"}] //
         Normal // Values // Most)], VertexLabels -> "Name", 
  ImageSize -> 1024]
getGraph["Othello"]

enter image description here

Well, what about how many movies or TV Shows were made of the most famous plays?

plays={"Macbeth", "Romeo and Juliet", "Othello", "Hamlet", "King Lear", \
"Richard III", "The Tempest", "Merry Wives of Windsor", "Titus \
Andronicus"}
movies[play_] := 
 Import["http://www.imdb.com/find?q=" ~~ 
    StringReplace[play, " " -> "%20"] ~~ 
    "&s=tt&exact=true&ref_=fn_tt_ex", "Data"][[4, 1]]

shakespeareMovies = {Length@movies[#], #} & /@ plays // Sort // 
   Transpose;
BarChart[#[[1]], BarOrigin -> Left, ChartLabels -> #[[2]], 
   Frame -> True, 
   PlotLabel -> "Shakespeare Based Movies"] &@shakespeareMovies

BarChart[#[[1]], BarOrigin -> Left, ChartLabels -> #[[2]], 
   Frame -> True, PlotLabel -> "Shakespeare Based Movies", 
   PlotTheme -> "Detailed", ImageSize -> Large] &@shakespeareMovies

enter image description here

POSTED BY: Diego Zviovich

Dear @Vitaliy Kaurov ,

I haven't had much time to do this but here are a couple of thoughts. It is certainly interesting to use the Wolfram Language to analyse Shakespeare's texts. He is very much on everyone's mind; here are the English language wikipedia requests:

data = WolframAlpha[ "Shakespeare", {{"PopularityPod:WikipediaStatsData", 1}, "ComputableData"}];
DateListPlot[data, PlotRange -> All, PlotTheme -> "Detailed", 
 AspectRatio -> 1/4, ImageSize -> 800, PlotLegends -> {"Shakespeare"},Filling -> Bottom]

enter image description here

There is a clear peak every year on 23 April - the anniversary of his death. Of course, there are wikipedia articles in many other languages on Shakespeare:

WikipediaData["Shakespeare", "LanguagesList"]

enter image description here

Interestingly, we can see the the longest articles are not (!) in English:

wikilength = 
  Select[{#, 
      Quiet[WordCount[
        WikipediaData[Entity["Person", "WilliamShakespeare::s9r82"], 
         "ArticlePlaintext", "Language" -> #]]]} & /@ 
    WikipediaData["Shakespeare", "LanguagesList"], NumberQ[#[[2]]] &];
BarChart[(Reverse@
    SortBy[Append[
      wikilength, {"English", 
       WordCount[
        WikipediaData[Entity["Person", "WilliamShakespeare::s9r82"], 
         "ArticlePlaintext"]]}], Last])[[1 ;; 60, 2]], 
 ChartLabels -> (Rotate[#, Pi/2] & /@ (Reverse@
      SortBy[Append[
         wikilength, {"English", 
          WordCount[
           WikipediaData[
            Entity["Person", "WilliamShakespeare::s9r82"], 
            "ArticlePlaintext"]]}], Last][[All, 1]]))]

enter image description here

I have downloaded the csv/xls version of the collected works you linked to, and then imported the data into Mathematica in the standard way:

texts = Import["/Users/thiel/Desktop/will_play_text.csv.xls"];

There are

Length[texts]

111396 rows in that file. They look like this:

texts[[1 ;; 10]] // TableForm

enter image description here

The first entry in every row just counts up then there is the name of the play, then two other bits of information on the position in the text, then there is the person who speaks and then there is what they say. There are also some spare quotation marks etc. This can be cleaned and we can look at the most important words that Shakespeare uses (here in his Sonnets):

WordCloud[
 DeleteStopwords[Flatten[TextWords /@ (StringReplace[StringSplit[#, ";"], "\"" -> ""] & /@ texts[[1 ;;, -1]])[[All, -1]]]], IgnoreCase -> True]

enter image description here

We can also count the words in all texts:

Length@Flatten[
  TextWords /@ (StringReplace[StringSplit[#, ";"], "\"" -> ""] & /@ texts[[1 ;;, -1]])[[All, -1]]]

which gives 775418 words. I can also sort the sentences by play like so

byplay = GroupBy[(StringReplace[StringSplit[#, ";"], "\"" -> ""] & /@ texts[[1 ;;, -1]]), #[[2]] &];

No magic here, but it is quite useful if we want to automatically generate a graph of who talks to whom, similar to one of the really cool demonstrations on the demonstration project:

g = Graph[
  DeleteDuplicates@(Rule @@@ 
     Partition[byplay[[18, All, 5]] //. {a___, x_, y_, b___} /; x == y -> {a, x, b}, 2, 1]), VertexLabels -> "Name", 
  VertexLabelStyle -> Directive[Red, Italic, 12], Background -> Black,EdgeStyle -> Yellow, VertexSize -> Small, VertexStyle -> Yellow]

byplay[[18, All, 5]] chooses play 18 - which happens to be Macbeth. It then takes all sentences and the speaker (entry 5). The rule deletes repeated speakers, i.e. if one speaker says several lines in a row I just use one incident. The Partition function always chooses two consecutive speakers (assuming that they interact). Finally I delete the duplicates and plot it; this give the following graph:

enter image description here

It is not trivial to make the same thing for other plays, say Romeo and Juliet (play 28):

g2 = Graph[
  DeleteDuplicates@(Rule @@@ 
     Partition[byplay[[28, All, 5]] //. {a___, x_, y_, b___} /; x == y -> {a, x, b}, 2, 1]), VertexLabels -> "Name", 
  VertexLabelStyle -> Directive[Red, Italic, 12], Background -> Black,EdgeStyle -> Yellow, VertexSize -> Small, VertexStyle -> Yellow]

enter image description here

You can use the following to get a list of all plays:

Partition[Normal[byplay[[1 ;;, 1, 2]]][[All, 2]], 4] // TableForm

enter image description here

We can also use the very handy CommunityGraphPlot feature to look for groups of people in Romeo and Juliet:

g3 = CommunityGraphPlot[
  DeleteDuplicates@(Rule @@@ Partition[byplay[[28, All, 5]] //. {a___, x_, y_, b___} /; x == y -> {a, x, b}, 2, 1]), VertexLabels -> "Name", 
  VertexLabelStyle -> Directive[Red, Italic, 12], Background -> Black,EdgeStyle -> Yellow, VertexSize -> Small, VertexStyle -> Yellow]

enter image description here

Oh, yes, and once we have the graphs we can use PageRank to figure out who the central characters are (for Macbeth):

Grid[Reverse@SortBy[Transpose[{VertexList[g], PageRankCentrality[g]}], Last], Frame -> All]

enter image description here

and here the same for Romeo and Juliet:

Grid[Reverse@SortBy[Transpose[{VertexList[g2], PageRankCentrality[g2]}], Last], Frame -> All]

enter image description here

Now I wanted to do something a bit more sophisticated, but the file I downloaded wasn't too good for that. I therefore downloaded the collection of all works of Shakespeare from the Gutenberg project:

shakespeare = Import["http://www.gutenberg.org/cache/epub/100/pg100.txt"];

It contains the following titles:

titles = (StringSplit[#, "\n"] & /@ 
     StringTake[StringSplit[shakespeare, "by William Shakespeare"][[1 ;;]], -45])[[;; -3, -1]];
TableForm[titles]

enter image description here

It is all one long string, so we might want to split it into the different plays etc:

textssplit = (StringSplit[shakespeare, "by William Shakespeare"][[2 ;; -2]]);

I'll next delete a copyright comment and make a list of names of the plays and the corresponding texts.

alltexts = 
  Table[{titles[[i]], 
    StringDelete[textssplit[[i]], 
     "<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
     SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND 
     IS PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
     WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE
     DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
     PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
     COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
     SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>"]}, {i, 1, Length[titles]}];

Ok. Now we have something to work with. here is a primitive sentiment analysis, similar to the one we used for the GOP presidential debates.

sentiments = 
  Table[{titles[[i]], -"Negative" + "Positive" /. ((Classify["Sentiment", #, "Probabilities"] & /@ #) &@
       Select[TextSentences[alltexts[[i, 2]]], Length[TextWords[#]] > 1 &])}, {i, 1, Length[titles]}];

So we use a bit of machine learning and use the certainty of the algorithm to obtain estimates of sentiments. We should also average over a couple of consecutive sentences which is why we use MovingAverage

MovingAverage[sentiments[[4, 2]], 30] // ListLinePlot

enter image description here

Ok. Let's normalise that to a standard length 1 (i.e. position in text in percent) and make it a bit more appealing:

ArrayReshape[
  ListLinePlot[
     Transpose@{Range[Length[#[[2]]]]/Length[#[[2]]], #[[2]]}, 
     PlotLabel -> #[[1]], Filling -> Axis, ImageSize -> Medium, 
     Epilog -> {Red, 
       Line[{{0, Mean[#[[2]]]}, {1, Mean[#[[2]]]}}]}] & /@ 
   Transpose@{titles, 
     MovingAverage[#, 50] & /@ sentiments[[All, 2]]}, {19, 
   2}] // TableForm

enter image description here

Let's make a couple of WordClouds for the individual plays. First I want an image of Shakespeare:

img = Import[
  "http://i0.wp.com/whatson.london/images/2013/07/Shakespeare.png?w=590"]; mask = Binarize[img, 0.0001];

We can then calculate a WordCloud in the shape of Shakespeare and overlay that to the image of the genius:

cloud = Image[WordCloud[DeleteStopwords[TextWords[alltexts[[2, 2]]]], mask, IgnoreCase -> True]]; 
ImageCompose[cloud, {ImageResize[img, ImageDimensions[cloud]], 0.2}]

enter image description here

Ok. Let's do that for all texts:

Monitor[wordclouds = 
   Table[{alltexts[[k, 1]], 
     cloud = Image[WordCloud[DeleteStopwords[TextWords[alltexts[[k, 2]]]], mask, IgnoreCase -> True]]; 
     ImageCompose[cloud, {ImageResize[img, ImageDimensions[cloud]], 0.2}]}, {k, 1,Length[titles]}];, k]

and plot it:

ArrayReshape[wordclouds[[All, 2]], {19, 2}] // TableForm

enter image description here

We can now also count the words per text and see how many different words Shakespeare uses:

textlength = Transpose[{alltexts[[All, 1]], Length[TextWords[#]] & /@ alltexts[[All, 2]]}];
textvocabulary = Transpose[{alltexts[[All, 1]], Length[DeleteDuplicates[ToLowerCase[DeleteStopwords[TextWords[#]]]]] & /@ alltexts[[All, 2]]}];

Here is a representation of that:

Show[
ListPlot[Table[Tooltip[{textlength[[k, 2]], textvocabulary[[k, 2]]}, textlength[[k, 1]]], {k, 1, Length[textlength]}], 
PlotStyle -> Directive[Red], AxesLabel -> {"Length", "Vocabulary"}, LabelStyle -> Directive[Bold, Medium]], 
Plot[Evaluate@Normal[LinearModelFit[Transpose[{textlength[[All, 1]], textlength[[All, 2]], textvocabulary[[All, 2]]}][[All, {2, 3}]], x, x]], {x, 0,32000}]]

enter image description here

We can also construct something which is like a semantic network. Basically, we delete all Stopwords and build a network of the sequences of remaining words:

Graph[Rule @@@ Partition[DeleteStopwords[TextWords[alltexts[[2, 2]]]], 2, 1][[1 ;; 100]], VertexLabels -> "Name"]

enter image description here

If we want to be really fancy about it, we can do the whole thing in 3D:

Graph3D[Rule @@@ Partition[DeleteStopwords[TextWords[alltexts[[2, 2]]]], 2, 1][[1 ;; 100]], ImageSize -> Large, VertexLabels -> "Name", VertexLabelStyle -> Directive[Red, Italic, 10]]

enter image description here

Well, that's not a lot, but perhaps a starting point.

Cheers,

M.

POSTED BY: Marco Thiel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard