Group Abstract

Message Boards

WOLFRAM COMMUNITY

35.4K Views

15 Replies

45 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Data Science Graphics and Visualization Wolfram Language Wolfram Cloud Wolfram Development Platform Computational Linguistics Computational Humanities

Analytics of Republican Debate and network percolation

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 10 years ago

Alan Joyce sent me some neat code of analysis of Republican Debate Sep. 16, 2015. Please do see his analytics below. Transcripts of debate can be found online. Alan mined most popular words used by the candidates filtered and re-weighted by different criteria. Properly weighted WordClouds are a good way to grasp key topics. I just wanted to point to graph & networks take on the data. I thought that some candidates may share some top words they use. So if the candidates are nodes, then a weighted edge between them reflects upon how many top words they share. If you consider 1 top word per candidate then the graph will be completely disconnected as each candidate has own unique single top word. As you increase top words' pool some of them will be common and shared between some candidates and links between nodes will appear. Percolation is the moment when, driven by top-words pool-size, all candidates become connected. In the opposite limit of large pool-size all candidates are connected and we get a complete graph. So below is the percolation moment that happens at 5 top words per candidate. It is indicative of which candidates speak about top common subjects. CommunityGraphPlot[HighlightGraph[SetProperty[g, EdgeLabels -> None], Table[Style[e, Opacity[.7], Thickness[.005 PropertyValue[{g, e}, EdgeWeight]]], {e, EdgeList[g]}]], CommunityBoundaryStyle -> Directive[Red, Dashed, Thick], CommunityRegionStyle -> {Directive[Opacity[.1], Red], Directive[Opacity[.1], Yellow], Directive[Opacity[.1], Blue]}] The edge thickness is reflective of number of common words. Grouping shows clustering of candidates around common words. And vertex size come from DegreeCentrality. DegreeCentrality will give high centralities to vertices that have high vertex degrees. So candidates with top words similar to more other candidates will have larger vertices. Clustered CommunityGraphPlot was derived from the top words: topWords = Sort[Normal[highFrequencyForCloud[#]], #1[[2]] > #2[[2]] &][[;; 5]][[All, 1]] & /@ candidates; TableForm[topWords, TableHeadings -> {candidates, None}] ( refining text filters would narrow top words more precisely ) and WeightedAdjacencyGraph: mocw = Outer[Length[Intersection[#1, #2]] &, topWords, topWords, 1] (1 - IdentityMatrix[10]) /. 0 -> Infinity; mocw // MatrixForm g = WeightedAdjacencyGraph[candidates, mocw, VertexLabels -> "Name", EdgeLabels -> "EdgeWeight",EdgeLabelStyle -> 15, VertexLabelStyle -> 14, VertexSize -> "DegreeCentrality", GraphStyle -> "ThickEdge", GraphLayout -> "CircularEmbedding", VertexStyle -> Directive[Opacity[.8], Orange]] For top words extraction and better refinement see Alan's analysis right below. The notebook is attached to his post.

POSTED BY: Vitaliy Kaurov

15 Replies

Sort By:

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 9 years ago

- another post of yours has been selected for the Staff Picks group, congratulations ! We are happy to see you at the tops of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: EDITORIAL BOARD

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 10 years ago

connecting them by the times their daily hits overlap Sort of, yes, but "overlap" is a too broad term. The measure of that is "Correlation" - and it is exactly the name of the function used in the main block of code: m = Outer[Correlation, #, #, 1] &@ QuantityMagnitude[Normal[recent][[All, All, 2]]] (1 - IdentityMatrix[Length[fullnames]]) /. 0. -> Infinity;

POSTED BY: Vitaliy Kaurov

Jonathan Wallace

Jonathan Wallace, Wolfram

Posted 10 years ago

So it's essentially visually spikes of traffic to each candidates wiki page, connecting them by the times their daily hits overlap?

POSTED BY: Jonathan Wallace

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 10 years ago

Communities just group those candidates whose pages viewed by public more synchronously. Wiki data are in hits per day based on weekly averages of daily hits to English-language page. That explanation can be seen on any W\|A page under the wiki-data plot - for example: Donald Trump

POSTED BY: Vitaliy Kaurov

Jonathan Wallace

Jonathan Wallace, Wolfram

Posted 10 years ago

Is the wiki data hits per page or what exactly is that data? I'm unsure what the "communities" are...wiki page queries?

POSTED BY: Jonathan Wallace

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 10 years ago

Very interesting, Marco! The Wiki data can actually be used to reflect on what candidates people view as related. Again your data: fullnames = {{"Donald", "TRUMP"}, {"Jeb", "BUSH"}, {"Scott", "WALKER"}, {"Marco", "RUBIO"}, {"Chris", "CHRISTIE"}, {"Ben", "CARSON"}, {"Rand", "PAUL"}, {"Ted", "CRUZ"}, {"Mike", "HUCKABEE"}, {"John", "KASICH"}, {"Carly", "FIORINA"}}; data = ParallelMap[WolframAlpha[#[[1]] <> " " <> #[[2]], {{"PopularityPod:WikipediaStatsData", 1}, "ComputableData"}] &, fullnames]; But I'll get the last year to be fair to the recent campaign and use log plot to see better the details: recent = TimeSeriesWindow[#, {{2014, 1, 1}, Now}] &@ TemporalData[data]; DateListLogPlot[recent, PlotRange -> All, PlotTheme -> "Detailed", AspectRatio -> 1/4, ImageSize -> 800, PlotLegends -> fullnames[[All, 2]]] Let's get mutual correlation matrix - note the diagonal INfinity trick - for the self-edge removal in WeightedAdjacencyGraph. m = Outer[Correlation, #, #, 1] &@ QuantityMagnitude[Normal[recent][[All, All, 2]]] (1 - IdentityMatrix[Length[fullnames]]) /. 0. -> Infinity; Significant negative correlations are hard to get in such data, but positive values can be quite high: m // Flatten // Sort MatrixPlot[m, FrameTicks -> {Transpose[{Range[11], #}], Transpose[{Range[11], Rotate[#, Pi/2] & /@ #}]}, ColorFunction -> "Rainbow"] &@fullnames[[All, 2]] So we are getting a complete weighted graph: g = WeightedAdjacencyGraph[m, VertexLabels -> Thread[Range[11] -> fullnames[[All, 2]]], VertexSize -> "ClosenessCentrality", VertexStyle -> Opacity[.5]]; FindGraphCommunities still react on EdgeWeight: comm = FindGraphCommunities[g] `{{1, 5, 6, 9, 10, 11}, {2, 3, 4, 7, 8}}` So I wonder if anyone with actual knowledge of politics can see in this clustering some truth: CommunityGraphPlot[g, comm]

Very interesting, Marco! The Wiki data can actually be used to reflect on what candidates people view as related. Again your data:

fullnames = {{"Donald", "TRUMP"}, {"Jeb", "BUSH"}, {"Scott", "WALKER"}, {"Marco", "RUBIO"}, 
                     {"Chris", "CHRISTIE"}, {"Ben", "CARSON"}, {"Rand", "PAUL"}, {"Ted", "CRUZ"}, 
                     {"Mike", "HUCKABEE"}, {"John", "KASICH"}, {"Carly", "FIORINA"}};

data = ParallelMap[WolframAlpha[#[[1]] <> " " <> #[[2]], 
{{"PopularityPod:WikipediaStatsData", 1}, "ComputableData"}] &, fullnames];

But I'll get the last year to be fair to the recent campaign and use log plot to see better the details:

recent = TimeSeriesWindow[#, {{2014, 1, 1}, Now}] &@ TemporalData[data];
DateListLogPlot[recent, PlotRange -> All, PlotTheme -> "Detailed", AspectRatio -> 1/4, 
 ImageSize -> 800, PlotLegends -> fullnames[[All, 2]]]

enter image description here

Let's get mutual correlation matrix - note the diagonal INfinity trick - for the self-edge removal in WeightedAdjacencyGraph.

m = Outer[Correlation, #, #, 1] &@ 
QuantityMagnitude[Normal[recent][[All, All, 2]]] (1 - IdentityMatrix[Length[fullnames]]) /. 0. -> Infinity;

Significant negative correlations are hard to get in such data, but positive values can be quite high:

m // Flatten // Sort

enter image description here

MatrixPlot[m, FrameTicks -> {Transpose[{Range[11], #}], Transpose[{Range[11], Rotate[#, Pi/2] & /@ #}]}, 
   ColorFunction -> "Rainbow"] &@fullnames[[All, 2]]

enter image description here

So we are getting a complete weighted graph:

g = WeightedAdjacencyGraph[m, VertexLabels -> Thread[Range[11] -> fullnames[[All, 2]]], 
   VertexSize -> "ClosenessCentrality", VertexStyle -> Opacity[.5]];

FindGraphCommunities still react on EdgeWeight:

comm = FindGraphCommunities[g]

{{1, 5, 6, 9, 10, 11}, {2, 3, 4, 7, 8}}

So I wonder if anyone with actual knowledge of politics can see in this clustering some truth:

CommunityGraphPlot[g, comm]

enter image description here

POSTED BY: Vitaliy Kaurov

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 10 years ago

Hi, there are two more little things to add. Vitaliy has this fantastic post on measuring interest in the conflicts in Syria and Ukraine. We can of course use the same technique to study people's interest in the presidential candidates: fullnames = {{"Donald", "TRUMP"}, {"Jeb", "BUSH"}, {"Scott", "WALKER"}, {"Marco", "RUBIO"}, {"Chris", "CHRISTIE"}, {"Ben", "CARSON"}, {"Rand", "PAUL"}, {"Ted", "CRUZ"}, {"Mike", "HUCKABEE"}, {"John", "KASICH"}, {"Carly", "FIORINA"}}; data = WolframAlpha[#[[1]] <> " " <> #[[2]], {{"PopularityPod:WikipediaStatsData", 1}, "ComputableData"}] & /@ full names; DateListPlot[data, PlotRange -> All, PlotTheme -> "Detailed", AspectRatio -> 1/4, ImageSize -> 800, PlotLegends -> fullnames[[All, 2]]] It would now be interesting to identify what the peaks mean. Some are more obvious than others, but I have not got a neat and automated way to identify the events that cause these peaks. Vitaliy, I think that in your post you identified the peaks "manually". There are websites like Wikipedia, that list important events for most days. But the data does not appear to suffice to identify peaks at this level of detail automatically. Do you have any idea as to how to automise that? Another thing is that we could draw an angle path from the sentiment list. This looks like so: candidates = {"TRUMP", "BUSH", "WALKER", "RUBIO", "CHRISTIE", "CARSON", "PAUL", "CRUZ", "HUCKABEE", "KASICH", "FIORINA"}; sentimentlist = Table[-"Negative" + "Positive" /. ((Classify["Sentiment", #, "Probabilities"] & /@ #) &@TextSentences@Part[debateBySpeaker[#] & /@ candidates, k]), {k, 1, Length[candidates]}]; ListLinePlot[AnglePath[#] & /@ sentimentlist, PlotLegends -> candidates, ImageSize -> Large] This plot is (relatively) easy to interpret: if the sentences are positive the curve bends left, otherwise right. Cheers, Marco

POSTED BY: Marco Thiel

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 10 years ago

Hi everyone, this is a really nice discussion. Together with the recent blog post, there is not much I can contribute, but I made a couple of observations that I would like to add anyway. I used the transcript in Alan Joyce's post above, so I will assume that we have his variable textRaw and then use exactly the same functions taken (stolen?) from his fantastic post: debateBySpeaker = StringJoin /@ GroupBy[StringSplit[#, ": ", 2] & /@ StringTrim /@ StringSplit[ StringDelete[ StringReplace[ StringReplace[ StringDelete[textRaw, "\n" \| "(APPLAUSE)" \| "(LAUGHTER)" \| "(CROSSTALK)" \| "(COMMERCIAL BREAK)" \| ("UNKNOWN") \| "know" \| "going" \| "think" \| "people" \| "say" \| "said" \| "country" \| "want" \| "need"], "..." -> " "], name : RegularExpression["[A-Z ]+:"] :> "\n" <> name], RegularExpression["\[[a-z ]+\]"]], "\n"], First][[All, All, 2]]; candidates = {"TRUMP", "BUSH", "WALKER", "RUBIO", "CHRISTIE", "CARSON", "PAUL", "CRUZ", "HUCKABEE", "KASICH", "FIORINA"}; Sentiments appear to be crucial in these debates and recent functionality of the Wolfram Language is really useful in this context. For a recent analysis I did on the sentiments in about 10000 books (which I can post if there is any interest), I developed a little function that uses the Wolfram Language's sentiment analysis feature. The Classify function also gives probabilities which I use as weights to get better estimates. Here is the call for Mr. Trumps contribution: sentimentlistTrump = -"Negative" + "Positive" /. ((Classify["Sentiment", #, "Probabilities"] & /@ #) & @ TextSentences@Part[debateBySpeaker[#] & /@ candidates, 1]); It turns out that a little bit of averaging is in order. In books I use more sentences but the candidates' contributions are rather short so I will use a window of ten sentences. MovingAverage[sentimentlistTrump, 10] // ListLinePlot Calculating the sentiment lists for all candidates is straight forward now: sentimentlist = Table[-"Negative" + "Positive" /. ((Classify["Sentiment", #, "Probabilities"] & /@ #) & @ TextSentences@Part[debateBySpeaker[#] & /@ candidates, k]), {k, 1, Length[candidates]}]; The number of sentences spoken depends very much on the candidate. Therefore, I "normalise" all contributions to length 1, or 100%. That looks like this: ArrayReshape[ ListLinePlot[ Transpose@{Range[Length[#[[2]]]]/Length[#[[2]]], #[[2]]}, PlotLabel -> #[[1]], Filling -> Axis, ImageSize -> Medium, Epilog -> {Red, Line[{{0, Mean[#[[2]]]}, {1, Mean[#[[2]]]}}]}] & /@ Transpose@{candidates, sentimentlist}, {6, 2}] // TableForm The red line shows the "average sentiment"; it is interesting that it is negative for all but two candidates. Also, everybody starts on a positive note and many end positively. This calls for a little further analysis. I first generate a list of the lengths of all sentences spoken by each candidate. I will use this to plot a histogram of sentence lengths. wordspersentence = ((Length /@ (TextWords /@ TextSentences[#]))) & /@ (debateBySpeaker[#] & /@ candidates); We can now generate a little dataset like so: Dataset[<\|"Candidate" -> #[[1]], "Mean" -> Mean[#[[2]]], "Variance" -> Variance[#[[2]]], "sentences" -> Length[#[[2]]], "Length sentences" -> Histogram[#[[3]], 100, PlotTheme -> "Marketing", FrameLabel -> {"# words", "frequency"}, ImageSize -> 200, PlotRange -> {{0, 100}, All}]\|> & /@ Transpose[{candidates, sentimentlist, wordspersentence}]] The "mean" column describes the mean sentiment; apart from candidates Cruz and Kasich all are negative. The most negative is Mr Carson. The variance is calculated for the sentiments, too. Mr. Trump's statements seem to have the largest variation of sentiments, i.e. he seems to display more extreme/emotional statements/sentiments than the other candidates. Mr Trump appears to say many more sentences than the others, and seems to be dominating the debate in terms of number of sentences. I then thought that it might be a good idea to look at the WordClouds separately for all the positive and all the negative statements. What words would candidates use in positive and which ones in negative statements? Monitor[posnegCloud = Table[<\|"Candidate" -> candidates[[k]], "pos WordCloud" -> WordCloud[ DeleteStopwords[ Flatten[TextWords[ Select[Transpose[{sentences[[k]], sentimentlist[[k]]}], #[[2]] > 0 &][[All, 1]]]]], IgnoreCase -> True], "neg WordCloud" -> WordCloud[ DeleteStopwords[ Flatten[TextWords[ Select[Transpose[{sentences[[k]], sentimentlist[[k]]}], #[[2]] < 0 &][[All, 1]]]]], IgnoreCase -> True]\|>, {k, 1, Length[candidates]}], k] I use Monitor to be updated on the progress of the calculation. When it's done, I create a dataset: posnegdata=Dataset[posnegCloud] It is much better to see when you run it on your computer and can properly enlarge it. We can display entries like so: posnegdata[2] It is interesting to see that Mr Trump comes up very prominently in the negative sentences of Mrs Fiorina. "Donald" appears in negative sentences of Mr Bush. It looks like an interesting social network. Well, let's have a look at that. We first see which candidates uses which other candidates name. adjacency1=Table[Length[Select[sentences[[i]], Evaluate[StringContainsQ[#, candidates[[j]], IgnoreCase -> True] &]]], {j, 1, Length[candidates]}, {i, 1, Length[candidates]}]; We can then plot the corresponding graph: interactions = AdjacencyGraph[ Transpose@ Table[Length[ Select[sentences[[i]], Evaluate[ StringContainsQ[#, candidates[[j]], IgnoreCase -> True] &]]], {j, 1, Length[candidates]}, {i, 1, Length[candidates]}], VertexLabels -> Rule @@@ Transpose[{Range[11], candidates}]] and calculate some "importance measures": Grid[Join[{{"candidate", "BetweenCentral", "Pagerank"}}, Reverse@SortBy[ Transpose[{candidates, BetweennessCentrality[interactions], PageRankCentrality[interactions, 0.1]}], #[[2]] &]], Frame -> All] This show that interestingly Mr Bush seems to be most central to the debate. Unfortunately, this last analysis is incorrect. I only checked for the surnames, but the word clouds told us that for example Mr Bush uses "Donald". Luckily, this is not difficult to fix. We first need the given names of the candidates: givennames = {"Donald", "Jeb", "Scott", "Marco", "Chris", "Ben", "Rand", "Ted", "Mike", "John", "Carly"}; to generate their full names: fullnames = Transpose[{givennames, candidates}] ({{"Donald", "TRUMP"}, {"Jeb", "BUSH"}, {"Scott", "WALKER"}, {"Marco", "RUBIO"}, {"Chris", "CHRISTIE"}, {"Ben", "CARSON"}, {"Rand", "PAUL"}, {"Ted", "CRUZ"}, {"Mike", "HUCKABEE"}, {"John", "KASICH"}, {"Carly", "FIORINA"}}) Then as before: interactionsfull = AdjacencyGraph[ Transpose@ Table[Length[ Select[sentences[[i]], Evaluate[ StringContainsQ[#, fullnames[[j]], IgnoreCase -> True] &]]], {j, 1, Length[candidates]}, {i, 1, Length[candidates]}], VertexLabels -> Rule @@@ Transpose[{Range[11], candidates}]] Interesting, now Mr Cruz appears to be quite central. Our graph measures look like this: Now, both Mr Bush and Mr Trump have dropped substantially in relevance in the network. But something is odd here. Let's check again: This is how often candidates are referred to by their surname: Transpose[{candidates, Total /@ Table[Length[ Select[sentences[[i]], Evaluate[ StringContainsQ[#, candidates[[j]], IgnoreCase -> True] &]]], {j, 1, Length[candidates]}, {i, 1, Length[candidates]}]}] // TableForm And this is how often they are referred to by their given name: Transpose[{candidates, Total /@ Table[Length[ Select[sentences[[i]], Evaluate[ StringContainsQ[#, givennames[[j]], IgnoreCase -> True] &]]], {j, 1, Length[candidates]}, {i, 1, Length[candidates]}]}] // TableForm Ok, now we see that there is clearly a mistake. Why would Mr Cruz be so incredibly popular and only go by his first name? Let's check some of the sentences, that contain his name: Select[Flatten[sentences], StringContainsQ[#, "Ted", IgnoreCase -> True] &] Right, so I stupidly told my program to look out for "ted" as part of words like "committed". Well, that's too bad, but can be fixed. givennames = {" Donald", "Jeb", "Scott", "Marco", "Chris", "Ben", "Rand", " Ted ", "Mike", "John", "Carly"}; fullnames = Transpose[{givennames, candidates}]; (so I put space before and after Ted) and then again interactionsfull = AdjacencyGraph[ Transpose@ Table[Length[ Select[sentences[[i]], Evaluate[ StringContainsQ[#, fullnames[[j]], IgnoreCase -> True] &]]], {j, 1, Length[candidates]}, {i, 1, Length[candidates]}], VertexLabels -> Rule @@@ Transpose[{Range[11], candidates}]] Now again the importance of the candidates in the discussion: Grid[Join[{{"candidate", "BetweenCentral", "Pagerank"}}, Reverse@SortBy[ Transpose[{candidates, BetweennessCentrality[interactionsfull], PageRankCentrality[interactionsfull, 0.1]}], #[[2]] &]], Frame -> All] and Mr. Bush is first again. Note, that in terms of self-references Mr Trump wins. A community graph of this who-references-whom network is also interesting: CommunityGraphPlot[ Transpose@ Table[Length[ Select[sentences[[i]], Evaluate[ StringContainsQ[#, fullnames[[j]], IgnoreCase -> True] &]]], {j, 1, Length[candidates]}, {i, 1, Length[candidates]}], VertexLabels -> Rule @@@ Transpose[{Range[11], candidates}]] It is interesting that the reference network is quite different from Vitaliy's same-topic-graph. I think that the reference graph might be useful to understand who considers whom a direct contender: in these debates you tend to address people you have a difference of opinion rather than people you agree with, because you need to show why people need to vote for you and why you are different. It is interesting that for example Cruz, Huckabee and Kasich are a clique in my graph but are all in different cliques in Vitaliy's topic graph. There might still be other glitches in here, and sentiment analysis is a very delicate issue anyway, so this comes with a health warning. But I hope that someone with more insight than me can make something out of this. Cheers, Marco

POSTED BY: Marco Thiel

Alan Joyce

Alan Joyce, Wolfram|Alpha

Posted 10 years ago

Right, we should fix that. In the meantime, it's kind of interesting to look more closely at the words I threw out of the earlier clouds, and the context in which they appear. For example, "we need" is such a common phrase in these debates, but what is it that each candidate thinks "we" need?

POSTED BY: Alan Joyce

Jonathan Wallace

Jonathan Wallace, Wolfram

Posted 10 years ago

What if instead of showing what the candidates said, we show what people heard? I wonder if there's a way to pull Twitter data by #demdebate or #gopdebate for a word cloud of reactions?

POSTED BY: Jonathan Wallace

Christopher Carlson

Christopher Carlson, Wolfram Research

Posted 10 years ago

Oh, darn. I thought I was on to something. It's misleading, then, that the posts are not constructing the word clouds in the same way. The first thing people are going to do is compare the Democratic and Republican clouds, and draw wrong conclusions from the comparison.

POSTED BY: Christopher Carlson

Alan Joyce

Alan Joyce, Wolfram|Alpha

Posted 10 years ago

Not so interesting. Check the earlier notebooks I made a point of removing "people" and a handful of other words that were exceptionally common (across all candidates) in the context of the debates. The democratic clouds would showcase more significant differences between the candidates if they did the same thing.

POSTED BY: Alan Joyce

Christopher Carlson

Christopher Carlson, Wolfram Research

Posted 10 years ago

Very interesting that people is prominent in 4/5 of the Democratic word clouds in this this post http://blog.wolfram.com/2015/10/14/democratic-presidential-debate-word-clouds/ and none of the Republican ones in this post http://blog.wolfram.com/2015/08/13/the-winner-of-the-gop-presidential-debate/

POSTED BY: Christopher Carlson

Drew Lesso

Drew Lesso, Almost Music

Posted 10 years ago

This is just so much fun and informative; a timely use of current analytics.

POSTED BY: Drew Lesso

Alan Joyce

Alan Joyce, Wolfram|Alpha

Posted 10 years ago

This uses a preliminary transcript of the September 16 debate, with some manual editing to make it easier to process the edited text is included in the attached notebook, assigned to the variable `textRaw`. Feel free to try out the simple public app, or do some additional experimentation on the text. It'll be interesting to start watching trends over time, as the campaign season progresses and more debates occur. Manipulate[ BarChart[ReplaceAll[<\|# -> allCandidateCounts[#][ToLowerCase[word]] & /@ Keys[allCandidateCounts]\|>, _Missing -> 0], BarOrigin -> Left, ChartLabels -> Automatic, PlotLabel -> "Word frequency in the Sept. 16, 2015 Republican Debate"], {word, "freedom", InputField[#, String] &}] CloudDeploy[%, Permissions -> "Public"] ===> CloudObject"[https://www.wolframcloud.com/objects/d1b62bc5-f686-42b3-bab9-bb70436d7e02"] Basic counts debateBySpeaker = StringJoin /@ GroupBy[StringSplit[#, ": ", 2] & /@ StringTrim /@ StringSplit[StringDelete[ StringReplace[StringReplace[StringDelete[textRaw, "\n" \| "(APPLAUSE)" \| "(LAUGHTER)" \| "(CROSSTALK)" \| "(COMMERCIAL BREAK)" \| ("UNKNOWN") \| "know" \| "going" \| "think" \| "people" \| "say" \| "said" \| "country" \| "want" \| "need"], "..." -> " "], name : RegularExpression["[A-Z ]+:"] :> "\n" <> name], RegularExpression["\[[a-z ]+\]"]], "\n"], First][[All, All, 2]]; candidates = {"TRUMP", "BUSH", "WALKER", "RUBIO", "CHRISTIE", "CARSON", "PAUL", "CRUZ", "KASICH", "FIORINA"}; Multicolumn[Labeled[Framed@ WordCloud[DeleteStopwords@debateBySpeaker[#], IgnoreCase -> True, ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3] Only words with higher than overall mean frequency allCandidateCounts = <\|# -> Sort[Counts[ DeleteStopwords[ TextWords[ ToLowerCase@ StringReplace[debateBySpeaker[#], "." -> " "]]]]] & /@ candidates\|>; meanCounts = Merge[Values[allCandidateCounts], N[Mean[#]] &]; candidateVsMean = <\|# -> With[{cand = allCandidateCounts[#]}, <\|# -> {cand[#], meanCounts[#]} & /@ Keys[cand]\|>] & /@ candidates\|>; highFrequencyPerCandidate = Select[#, #[[1]] > #[[2]] &] & /@ candidateVsMean; highFrequencyForCloud = <\|# -> highFrequencyPerCandidate[#][[All, 1]] & /@ candidates\|>; Multicolumn[ Labeled[Framed@WordCloud[highFrequencyForCloud[#], IgnoreCase -> True, ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3] Individual frequency divided by overall mean frequency: candidateOverMean = <\|# -> With[{cand = allCandidateCounts[#]}, <\|# -> cand[#]/meanCounts[#] & /@ Keys[cand]\|>] & /@ candidates\|>; Multicolumn[ Labeled[Framed@WordCloud[candidateOverMean[#], IgnoreCase -> True, ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3] Unique words: Only show words that none of the other participants used uniqueKeys = <\|# -> FoldList[Complement, Keys[allCandidateCounts[#]], Keys[allCandidateCounts[#]] & /@ Complement[Keys[allCandidateCounts], {#}]][[-1]] & /@ Keys[allCandidateCounts]\|>; Multicolumn[ Labeled[Framed@WordCloud[KeyTake[allCandidateCounts[#], uniqueKeys[#]], IgnoreCase -> True, ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3] Attachments:

This uses a preliminary transcript of the September 16 debate, with some manual editing to make it easier to process the edited text is included in the attached notebook, assigned to the variable textRaw. Feel free to try out the simple public app, or do some additional experimentation on the text. It'll be interesting to start watching trends over time, as the campaign season progresses and more debates occur.

Manipulate[
 BarChart[ReplaceAll[<|# -> allCandidateCounts[#][ToLowerCase[word]] & /@ 
     Keys[allCandidateCounts]|>, _Missing -> 0], BarOrigin -> Left, 
  ChartLabels -> Automatic, 
  PlotLabel -> "Word frequency in the Sept. 16, 2015 Republican Debate"], {word, 
  "freedom", InputField[#, String] &}]

CloudDeploy[%, Permissions -> "Public"]

===> CloudObject"[https://www.wolframcloud.com/objects/d1b62bc5-f686-42b3-bab9-bb70436d7e02"]

Basic counts

debateBySpeaker = 
  StringJoin /@ GroupBy[StringSplit[#, ": ", 2] & /@ 
      StringTrim /@ StringSplit[StringDelete[
         StringReplace[StringReplace[StringDelete[textRaw, 
            "\n" | "(APPLAUSE)" | "(LAUGHTER)" | "(CROSSTALK)" | 
             "(COMMERCIAL BREAK)" | ("UNKNOWN") | "know" | "going" | 
             "think" | "people" | "say" | "said" | "country" | 
             "want" | "need"], "..." -> " "], 
          name : RegularExpression["[A-Z ]+:"] :> "\n" <> name], 
         RegularExpression["\[[a-z ]+\]"]], "\n"], First][[All, All, 2]];

candidates = {"TRUMP", "BUSH", "WALKER", "RUBIO", "CHRISTIE", 
   "CARSON", "PAUL", "CRUZ", "KASICH", "FIORINA"};

Multicolumn[Labeled[Framed@
     WordCloud[DeleteStopwords@debateBySpeaker[#], IgnoreCase -> True,
       ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3]

enter image description here

Only words with higher than overall mean frequency

allCandidateCounts = <|# -> 
      Sort[Counts[
        DeleteStopwords[
         TextWords[
          ToLowerCase@
           StringReplace[debateBySpeaker[#], "." -> " "]]]]] & /@ 
    candidates|>;

meanCounts = Merge[Values[allCandidateCounts], N[Mean[#]] &];

candidateVsMean = <|# -> 
      With[{cand = allCandidateCounts[#]}, <|# -> {cand[#], meanCounts[#]} & /@ 
         Keys[cand]|>] & /@ candidates|>;

highFrequencyPerCandidate = 
  Select[#, #[[1]] > #[[2]] &] & /@ candidateVsMean;

highFrequencyForCloud = <|# -> 
      highFrequencyPerCandidate[#][[All, 1]] & /@ candidates|>;

Multicolumn[
 Labeled[Framed@WordCloud[highFrequencyForCloud[#], IgnoreCase -> True, 
      ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3]

enter image description here

Individual frequency divided by overall mean frequency:

candidateOverMean = <|# -> 
      With[{cand = 
         allCandidateCounts[#]}, <|# -> cand[#]/meanCounts[#] & /@ 
         Keys[cand]|>] & /@ candidates|>;

Multicolumn[
 Labeled[Framed@WordCloud[candidateOverMean[#], IgnoreCase -> True, 
      ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3]

enter image description here

Unique words: Only show words that none of the other participants used

uniqueKeys = <|# -> 
      FoldList[Complement, Keys[allCandidateCounts[#]], 
        Keys[allCandidateCounts[#]] & /@ 
         Complement[Keys[allCandidateCounts], {#}]][[-1]] & /@ 
    Keys[allCandidateCounts]|>;

Multicolumn[
 Labeled[Framed@WordCloud[KeyTake[allCandidateCounts[#], uniqueKeys[#]], 
      IgnoreCase -> True, ImageSize -> 300], Style[#, "Section"], Top] & /@ candidates, 3]

enter image description here

POSTED BY: Alan Joyce

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback