Group Abstract Group Abstract

Message Boards Message Boards

400th anniversary of Shakespeare's death

Posted 10 years ago

NOTE: the actual APP that does some analysis of Shakespeare's "Romeo and Juliet" is located HERE. Please wait through a potential little load or evaluation times, it is computing ! ;-) Read below and through comments for many ideas on Shakespeare's data mining.

enter image description here

I also highly recommend reading recent related blog by Jofre Espigule and checking out his Wolfram Cloud app that does some social and linguistic visualizations of the Shakespeare's texts:

enter image description here

April 23, 2016 marks 400th anniversary of Shakespeare’s death. Just a few decades of life's work produced texts that fascinate humanity for already 400 years. This centuries-old fascination tells us Shakespeare's works highlight the perpetual social and cultural phenomena. And also that Shakespeare is a seldom genius, a true master of the written and spoken word. But have you ever thought that Shakespeare's texts can be deemed as data? Perhaps Emerging filed of digital humanities can tell us what to read between the lines. Modern technologies can provide a new insight into social networks of characters, semantic, statistical and other properties of corpus that is usually considered of only high artistic value. Is there a pattern in the art?

Could you think of data mining analysis or visualizations to apply to Shakespeare's works? Please share your thoughts! Dive with Wolfram technologies into infinite depths of Shakespeare's data.

EXAMPLE: Storyline

Imagine I would like to see in a few quick pictures how the dramatic development of events propagates through a play. I consider "Romeo and Juliet" and download full text as a string (lower-casing all words):

romeojuliet = ToLowerCase[Import["http://shakespeare.mit.edu/romeo_juliet/full.html"]];

Now I will write a function drama that displays the density of a specific word in a play. It is done by indexing positions of words in the text and then running SmoothKernelDistribution algorithm hidden inside SmoothHistogram function that also plots the density:

drama[keywords_List] := With[
  {pos = StringPosition[romeojuliet, #][[All, 1]] & /@ keywords},
  SmoothHistogram[pos,
   Frame -> None, BaseStyle -> White,
   PlotLegends -> Placed[keywords, {{.93, .8}}],
   AspectRatio -> 1/3, ImageSize -> 700, PlotTheme -> "Marketing",
   PlotStyle -> {Automatic, Automatic, Dashed, Dashed, Dashed},
   Filling -> {1 -> {2}}, FillingStyle -> Directive[White, Opacity[.8]]]]

And now with a few computations drama reads the play and announces the verdict with just 3 images. Visually we see clearly what was important as the time went by. The 3rd image of interplay between "love", "hate", "life", and "death" speaks the most.

drama[{"romeo", "juliet", "life", "death"}]
drama[{"romeo", "juliet", "love", "hate"}]
drama[{"love", "hate", "life", "death"}]

enter image description here

To make a cloud app, we need to modify function a bit and use CloudDeploy.

dramaFORM[keywords_String] := Rasterize@Module[
   {pos, leg, keys = TextWords[ToLowerCase[keywords]]},
   leg = {"romeo", "juliet"}~Join~keys;
   pos = StringPosition[romeojuliet, #][[All, 1]] & /@ leg;
   SmoothHistogram[DeleteCases[pos, {} | {_Integer}],
    Frame -> None, PlotLegends -> Placed[leg, Bottom],
    AspectRatio -> 1/3, ImageSize -> 700, PlotTheme -> "Marketing",
    PlotStyle -> {Automatic, Automatic}~Join~Table[Dashed, {Length[leg] - 2}],
    Filling -> {1 -> {2}}, FillingStyle -> Directive[White, Opacity[.8]]]]

CloudDeploy[FormFunction[{
    "x" -> <|"Label" -> "", 
    "Interpreter"->"String",
    "Hint"->"hint: love, death",
    "Help"->Style["type Shakespeare's words separated by spaces or comma, be patient, wait, behold ;-)",Italic]|>}, 
    dramaFORM[#x]&,
    AppearanceRules-><|
    "Title" -> Grid[{{"Evolution of topics through Romeo & Juliet"},{Spacer[{10,5}]},{img}},Alignment->Center], 
    "Description" -> "DETAILS:  http://wolfr.am/RomeoJuliet "|>,
    FormTheme -> "Black"],
"RomeoAndJuliet",   
Permissions->"Public"]

EXAMPLE: Wordcloud

It is also interesting to know how modern society sees Shakespeare. The code below for the word cloud runs over Encyclopedia Britannica article about Shakespeare.

text=Import["http://www.britannica.com/print/article/537853"];
base[w_]:=With[{tmp=WordData[w,"BaseForm","List"]}, If[(Head[tmp]===Missing)||tmp==={},w,tmp[[1]]]];
SetAttributes[base,Listable];
tst=Quiet[base[TextWords[StringDelete[DeleteStopwords[ToLowerCase[text]],DigitCharacter..]]]];
blackLIST={"shakespeare","william","th","iii","iv","vi"};
WordCloud[DeleteCases[DeleteCases[tst,_First],Alternatives@@blackLIST],
    WordOrientation->{{-\[Pi]/4,\[Pi]/4}},AspectRatio->1/3,
    ScalingFunctions->(#^.01&),ImageSize->800]

enter image description here

DATA & CODE SOURCES:

POSTED BY: Vitaliy Kaurov
8 Replies

Here is my little contribution which is based on Chris Wilson's Wolfram Summer School side project about "Book Colors".

  • Gather a list of Color Names:

    colornames = {" alice blue ", " antique white ", " aqua ", " aquamarine ", " azure ", " beige ", " bisque ", " black ", " blanched almond ", " blue ", " blue violet ", " brown ", " burly wood ", " cadet blue ", " chartreuse ", " chocolate ", " coral ", " cornflower blue ", " cornsilk ", " crimson ", " cyan ", " dark blue ", " dark cyan ", " dark golden rod ", " dark gray ", " dark green ", " dark khaki ", " dark magenta ", " dark olive green ", " dark orange ", " dark orchid ", " dark red ", " dark salmon ", " dark sea green ", " dark slate blue ", " dark slate gray ", " dark turquoise ", " dark violet ", " deep pink ", " deep sky blue ", " dim gray ", " dodger blue ", " fire brick ", " floral white ", " forest green ", " fuchsia ", " gainsboro ", " ghost white ", " gold ", " golden rod ", " gray ", " green ", " green yellow ", " honey dew ", " hot pink ", " indian red ", " indigo ", " ivory ", " khaki ", " lavender ", " lavender blush ", " lawn green ", " lemon chiffon ", " light blue ", " light coral ", " light cyan ", " light gray ", " light green ", " light pink ", " light salmon ", " light sea green ", " light skyblue ", " light slate gray ", " light steel blue ", " light yellow ", " lime ", " lime green ", " linen ", " magenta ", " maroon ", " medium blue ", " medium orchid ", " medium purple ", " medium sea green ", " medium slate blue ", " medium spring green ", " mediumturquoise ", " medium violetred ", " midnight blue ", " mint cream ", " misty rose ", " moccasin ", " navajo white ", " navy ", " old lace ", " olive ", " olive drab ", " orange ", " orange red ", " orchid ", " pale golden rod ", " pale green ", " pale turquoise ", " pale violet red ", " papaya whip ", " peach puff ", " peru ", " pink ", " plum ", " powder blue ", " purple ", " red ", " rosy brown ", " royal blue ", " saddle brown ", " salmon ", " sandy brown ", " sea green ", " sea shell ", " sienna ", " silver ", " sky blue ", " slate blue ", " slate gray ", " snow ", " spring green ", " steel blue ", " teal ", " thistle ", " tomato ", " turquoise ", " violet ", " wheat ", " white ", " white smoke ", " yellow ", " yellow green "};
    
  • Use the color Interpreter to get them in the WL:

    colors = Interpreter["Color"][colornames]
    

colors

  • Make a PieChart of the StringCounts of these color names:

    Column@Table[
    link="http://shakespeare.mit.edu/"<>works<>"/full.html";
    text=ToLowerCase[Import[link]];
    PieChart[
    ParallelMap[StringCount[text,#]&,colornames],
    ChartStyle->colors,
    PlotLabel->Hyperlink[TextCases[text,"Line",1],link],
    LabelingFunction->"RadialCenter",
    ChartLabels->Placed[colornames,"RadialCallout"],
    ImageSize->600],
    {works,StringSplit[#,"/"][[-2]]&/@Import["http://shakespeare.mit.edu/","Hyperlinks"][[3;;-8]]}]
    

Enjoy!

Attachments:
POSTED BY: Bernat Espigulé

This is a fun analysis Bernat, but I worry about the colossal dominance of red! Is there really so much blood? This leads me to suspect we might be picking up the word red inside other words, murdered springs to mind!

Rather than count occurrence of the sequence "r e d", Mathematica lets us examine the characters around it. Presume that a colour is only intended if it doesn't follow or precede any other letters. We loose some good data (ie plurals like "reds"), but we filter out cases like "murdered" above.

colourOccuranceTable = ParallelTable[
    text = ToLowerCase[Import[ "http://shakespeare.mit.edu/" <> works <> "/full.html"]];
    Count[ Nor @@ LetterQ /@ # & /@
        Characters[ StringDrop[ StringCases[text, _ ~~ # ~~ _], {2, 1 + StringLength[#]}]]
        ,True] & /@ colornames
    ,{works, (StringSplit[#, "/"][[-2]] & /@ Import["http://shakespeare.mit.edu/", "Hyperlinks"][[3 ;; -8]])}];

We can then sum this data up over all the works, and compare it to your original analysis.

Comparison

Red is looking slightly less dominant!

POSTED BY: David Gathercole
POSTED BY: Bernat Espigulé
Attachments:
POSTED BY: Marco Thiel

I took this opportunity to use the Wolfram Language on XML documents, specifically a TEI (http://www.tei-c.org) version of "The Tempest," which can be found at the University of Oxford Text Archive (http://ota.ox.ac.uk).

We start by importing the xml document as an XMLObject.

tempestxml = Import["http://ota.ox.ac.uk/text/5725.xml", "XMLObject"];

After exploring the document a little, I found that I could extract lines by a specific speaker with Cases. Here is how to get all the lines spoken by Prospero:

proslines = 
  Cases[tempestxml, XMLElement["sp", {}, {XMLElement["speaker", _, {"Pros."}], line_}] :> line, Infinity];

I made a WordCloud of those lines, only to discover that we lack Elizabethan stopwords:

WordCloud[DeleteStopwords@ToLowerCase@TextWords[StringRiffle[Flatten[proslines//.XMLElement[_,_,content_]:>content]]]]

First Prospero WordCloud

It's much more satisfying after a minor tweak:

WordCloud[DeleteCases[DeleteStopwords@ToLowerCase@TextWords[StringRiffle[Flatten[
    proslines //. XMLElement[_, _, content_] :> content]]], "thee" | "thou" | "thy"]]

Second Prospero WordCloud

Why limit ourselves to one character, though?

linesbychar = First@First@# -> Last /@ # & /@ GatherBy[Cases[tempestxml, 
    XMLElement["sp", {}, {XMLElement["speaker", _, {char_}], line_}] :> char -> line, Infinity], First]; 

Grid[Partition[Column@{First@#, WordCloud[DeleteCases[DeleteStopwords@ToLowerCase@TextWords[
               StringRiffle[Flatten[Last@# //. XMLElement[_, _, content_] :> content]]], 
                    "thee" | "thou" | "thy"]]} & /@ linesbychar, 6], Frame -> All, Alignment -> Left]

WordClouds for all characters

We don't have to limit ourselves to characters, we can make WordCloud for each scene. In this document, each scene is contained in a <div>

scenes = Cases[tempestxml, XMLElement["div", _, div_] :> div, Infinity];
Grid[Partition[Column[{First@#, WordCloud[DeleteCases[DeleteStopwords@ToLowerCase@TextWords[
           StringRiffle[Flatten[Cases[Last@#, XMLElement["ab", _, line_] :> line, Infinity] //. XMLElement[_, _, content_] :> content]]],
             "thee" | "thou" | "thy"]]}] & /@ ((Replace[Flatten[Cases[#, 
                XMLElement["head", _, h_] :> (h //. XMLElement[_, _, content_] :> content)]], {s_String} :> s] -> Rest@#) & /@ scenes), 5],                 
                    Frame -> All, Alignment -> Left]

a WordCloud for each scene

POSTED BY: Aaron Enright

Dear @Diego Zviovich and @Vitaliy Kaurov,

I did not have much time yesterday night so I only did some quite basic things. Here are some more ideas. To put everything into an historical context, we might want to look at important events in Shakespeare's life. There is a website (actually there are zillions of them) which has the data in an easy-to-read form:

TimelinePlot[
 Association[{#[[2]] -> Interpreter["Date"][#[[1]]]} & /@ (StringSplit[#, "   "] & /@ 
 StringSplit[StringSplit[StringSplit[Import["http://www.shmoop.com/william-shakespeare/timeline.html", "Plaintext"], "How It All Went Down"][[2]], "BACK NEXT"][[1]], "\n"][[2 ;; ;; 3]])]] 

enter image description here

On the website there are little snippets of text that explain what happened. It is certainly possible to display them using Tooltip in this TimelinePlot. I also wondered where all the plays of Shakespeare were set. Another website contains the information. I use Interpreter to get the GeoCoordinates. It does not always appear to work. Some dots are in Australia and the US; here I restrict the plot to Europe and the Middle East.

places = Import["http://www.nosweatshakespeare.com/shakespeares-plays/shakespeares-play-locations/", "Data"][[2 ;;, 1, 1, 1, 2]][[1, All, -1]];
gpscoords = Interpreter["Location"][places];
GeoListPlot[Select[gpscoords, Head[#] === GeoPosition &], GeoRange -> GeoBoundingBox[{GeoPosition[{59.64927428005451, \
-22.259507086895418`}], GeoPosition[{26.793037464663843`, 48.84842129323249}]}], GeoBackground -> "ReliefMap", GeoProjection -> "Mercator", ImageSize -> Large]

enter image description here

Syllables and Meter

Ok. Now the next bits are a bit more complicated. I first wondered how the number of syllables would be per verse in the sonnets. Luckily the Wolfram Language has a function for that. But first I set everything up as above and look at the first sonnet.

shakespeare = Import["http://www.gutenberg.org/cache/epub/100/pg100.txt"];
titles = (StringSplit[#, "\n"] & /@ StringTake[StringSplit[shakespeare, "by William Shakespeare"][[1 ;;]], -45])[[;; -3, -1]];
textssplit = (StringSplit[shakespeare, "by William Shakespeare"][[2 ;; -2]]);
alltexts = 
  Table[{titles[[i]], 
    StringDelete[textssplit[[i]], 
     "<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
          SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND 
          IS PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
          WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE
          DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
          PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
          COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
          SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>"]}, {i, 1, Length[titles]}];
allsonnets = StringSplit[StringSplit[alltexts[[1, 2]], "THE END"][[1]], Reverse@(ToString /@ Range[154])][[2 ;;]];
allsonnets[[1]]

enter image description here

The Wolfram Language has WordData built in so I tried using the option "Hyphenation" to try and count the syllables - here for sonnet number 4.

WordData[ToLowerCase[#], "Hyphenation"] & /@ # & /@ (TextWords[#] & /@DeleteCases[StringDelete[StringSplit[allsonnets[[4]], "\n"], ","],""][[1 ;; -2]])

enter image description here

The problem is that to many words cannot be hyphenated automatically. It turns out that Wolfram|Alpha does a better job as discussed here. From there I also take the following function "syllables" which submits a query to Wolfram|Alpha:

ClearAll@syllables;
SetAttributes[syllables, Listable];
syllables[word_String] := Length@WolframAlpha["syllables " <> word, {{"Hyphenation:WordData", 1}, "ComputableData"}]

With that function we can analyse the first 10 verses of sonnet number 4:

Monitor[sonnet4 = 
  Table[{#, syllables[#]} & /@ (TextWords[#] & /@ 
       DeleteCases[StringDelete[StringSplit[allsonnets[[4]], "\n"], ","], ""][[1 ;; -2]])[[k]], {k, 1, Length[DeleteCases[
       StringDelete[StringSplit[allsonnets[[4]], "\n"], ","], ""][[1 ;; -2]]]}], k]

This gives:

TableForm [Reverse /@ # & /@ sonnet4]

enter image description here

where the numbers above the words indicate the estimated number of syllables. Note, that there is sometimes a difference between that number and the perceived number of syllables when you speak. Also, some words were probably pronounced quite differently in Shakespeare's time. Here is the number of syllables per verse:

Total /@ sonnet4[[All, All, 2]]
(*{8, 10, 10, 10, 11, 11, 8, 10, 10, 10, 10, 10, 10, 10}*)

Let's do that for one more sonnet:

Monitor[sonnet5 = 
  Table[{#, syllables[#]} & /@ (TextWords[#] & /@ 
       DeleteCases[StringDelete[StringSplit[allsonnets[[5]], "\n"], ","], ""][[1 ;; -2]])[[k]], {k, 1, 
    Length[DeleteCases[StringDelete[StringSplit[allsonnets[[5]], "\n"], ","], ""][[1 ;; -2]]]}], k]

This gives:

TableForm [Reverse /@ # & /@ sonnet5]

enter image description here

There are obviously some problems, such as "o'er-snowed", but over all I am quite impressed with this result. Here is the syllable count per verse:

Total /@ sonnet5[[All, All, 2]]
(*{9, 10, 9, 10, 8, 10, 10, 9, 10, 11, 10, 10, 10, 10}*)

We can now plot and compare the counts for the two sonnets.

ListLinePlot[{Total /@ sonnet4[[All, All, 2]], 
  Total /@ sonnet5[[All, All, 2]]}, PlotRange -> {All, {0, 12}}, LabelStyle -> Directive[Bold, Medium], AxesLabel -> {"verse #", "syllables"}]

enter image description here

In fact, Shakespeare's sonnets all apart from three, conform to the iambic pentameter, which is described here. On that website they say:

Shakespeare's sonnets are written predominantly in a meter called iambic pentameter, a rhyme scheme in which each sonnet line consists of ten syllables. The syllables are divided into five pairs called iambs or iambic feet. An iamb is a metrical unit made up of one unstressed syllable followed by one stressed syllable.

That is not exactly what we get, but we are close.

Rhyme and Meter

We can also try to figure out which verse rhymes with which. To do this we take the last word of every verse (first for sonnet 4):

(TextWords[#] & /@ DeleteCases[StringDelete[StringSplit[allsonnets[[4]], "\n"], ","], ""][[1 ;; -2]])[[All, -1]]

this gives

{"spend", "legacy", "lend", "free", "abuse", "give", "use", "live", "alone", "deceive", "gone", "leave", "thee", "be"}

By looking that that list we can immediately see the meter, but it would be nice to get this algorithmically. In fact, Wolfram|Alpha has again all we need.

pronounciation = WolframAlpha["IPA spend", {{"Pronunciation:WordData", 1}, "Plaintext"}]

enter image description here

The IPA gives the phonetical transcription, which is what I want. After a little bit of cleaning this is what I get for the first 10 sonnets:

Monitor[rhymingQ = 
  Table[(Quiet[(StringSplit[WolframAlpha["IPA " <> #, {{"Pronunciation:WordData", 1}, "Plaintext"}], {"IPA: ", ")"}][[2]])] /. {"IPA: ",")"} -> Missing["NotAvailable"]) & /@ (TextWords[#] & /@ 
DeleteCases[StringDelete[StringSplit[allsonnets[[l]], "\n"], ","], ""][[1 ;; -2]])[[All, -1]];, {l, 1, 10}], l]

This does not always work, but it is ok:

TableForm[# /. Missing["NotAvailable"] -> "NA" & /@ rhymingQ]

enter image description here

We should be able to work with that. I can now convert the last two symbols to their CharacterCode:

endingsounds = Take[ToCharacterCode[#], -2] & /@ (# /. Missing["NotAvailable"] -> "NA" & /@ rhymingQ)[[1]]
{{78, 65}, {97, 618}, {105, 115}, {605, 105}, {618, 122}, {601, 108}, {618, 122}, {601, 108}, {110, 116}, {78, 65}, {110, 116}, {78,65}, {712, 105}, {78, 65}}

Note that I have converted the Missing bits into "NA". I will want to Ignore the "NA"s. Next, I look for same endings:

Rule @@@ Flatten /@ 
  Select[Position[endingsounds, #] & /@ DeleteDuplicates[Select[endingsounds, # != {78, 65} &]], Length[#] == 2 &]
(*{5 -> 7, 6 -> 8, 9 -> 11}*)

This indicates which verse rhymes with which. I have missed some of them because of the "NA"s, but I hope to make up for that by using several sonnets.

alllinks = {}; Do[endingsounds = 
  Take[ToCharacterCode[#], -2] & /@ (# /. Missing["NotAvailable"] -> "NA" & /@ rhymingQ)[[i]]; 
 AppendTo[alllinks, Rule @@@ Flatten /@ Select[Position[endingsounds, #] & /@ 
 DeleteDuplicates[Select[endingsounds, # != {78, 65} &]], Length[#] == 2 &]], {i, 1, 10}]; alllinks = Flatten[alllinks]
{5 -> 7, 6 -> 8, 9 -> 11, 9 -> 11, 10 -> 12, 5 -> 7, 11 -> 13, 1 -> 3, 4 -> 14, 5 -> 7, 6 -> 8, 10 -> 12, 1 -> 3, 9 -> 11, 13 -> 14, 
 2 -> 4, 13 -> 14, 1 -> 3, 5 -> 7, 6 -> 8, 9 -> 11, 10 -> 12, 1 -> 3, 2 -> 4, 5 -> 7, 10 -> 12, 13 -> 14, 1 -> 3, 2 -> 4, 5 -> 7, 10 -> 12,
  13 -> 14}

Not elegant at all, and I see @Vitaliy Kaurov 's despair at these lines of code ... Anyway, it gives a nice graph.

Graph[alllinks, VertexLabels -> "Name", Background -> Black, EdgeStyle -> Yellow, VertexLabelStyle -> Directive[Red, 15]]

enter image description here

This illustrates the structure of the sonnets. If two nodes are usually linked like 1 and 3, 2 and 5, and 13 and 14 they tend to rhyme. Note that occasionally there are additional links like from 11 to 13 and from 4 to 14.

On the website referenced above they also say:

There are fourteen lines in a Shakespearean sonnet. The first twelve lines are divided into three quatrains with four lines each. In the three quatrains the poet establishes a theme or problem and then resolves it in the final two lines, called the couplet. The rhyme scheme of the quatrains is abab cdcd efef. The couplet has the rhyme scheme gg.

Out network recovers that exact structure. This rhyme structure distinguishes Shakespeare's style from for example Petrarcha's style. I have some interest in Petrarcha's poems so I might post a comparison later.

I also haven't really looked at the meter. I have only tried to count syllables. But the phonetical transcription does contain information about intonation. By combining the two bits of information we might be able to deduce the meter.

Cheers,

M.

POSTED BY: Marco Thiel
Posted 10 years ago
POSTED BY: Diego Zviovich
POSTED BY: Marco Thiel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard