Message Boards Message Boards

[Mathematica-vs-R] Text analysis of Trump tweets

GROUPS:

Introduction

This post is to proclaim the MathematicaVsR at GitHub project "Text analysis of Trump tweets" in which we compare Mathematica and R over text analyses of Twitter messages made by Donald Trump (and his staff) before the USA president elections in 2016.

This project follows and extends the exposition and analysis of the R-based blog post "Text analysis of Trump's tweets confirms he writes only the (angrier) Android half" by David Robinson at VarianceExplained.org; see [1].

The blog post [1] links to several sources that claim that during the election campaign Donald Trump tweeted from his Android phone and his campaign staff tweeted from an iPhone. The blog post [1] examines this hypothesis in a quantitative way (using various R packages.)

The hypothesis in question is well summarized with the tweet:

Every non-hyperbolic tweet is from iPhone (his staff).
Every hyperbolic tweet is from Android (from him). pic.twitter.com/GWr6D8h5ed
-- Todd Vaziri (@tvaziri) August 6, 2016

This conjecture is fairly well supported by the following mosaic plots, [2]:

TextAnalysisOfTrumpTweets-iPhone-MosaicPlot-Sentiment-Device TextAnalysisOfTrumpTweets-iPhone-MosaicPlot-Device-Weekday-Sentiment

We can see the that Twitter messages from iPhone are much more likely to be neutral, and the ones from Android are much more polarized. As Christian Rudder (one of the founders of OkCupid, a dating website) explains in the chapter "Death by a Thousand Mehs" of the book "Dataclysm", [3], having a polarizing image (online persona) is as a very good strategy to engage online audience:

[...] And the effect isn't small-being highly polarizing will in fact get you about 70 percent more messages. That means variance allows you to effectively jump several "leagues" up in the dating pecking order - [...]

(The mosaic plots above were made for the Mathematica-part of this project. Mosaic plots and weekday tags are not used in [1].)

Links

Concrete steps

The Mathematica-part of this project does not follow closely the blog post [1]. After the ingestion of the data provided in [1], the Mathematica-part applies alternative algorithms to support and extend the analysis in [1].

The sections in the R-part notebook correspond to some -- not all -- of the sections in the Mathematica-part.

The following list of steps is for the Mathematica-part.

  1. Data ingestion

    • The blog post [1] shows how to do in R the ingestion of Twitter data of Donald Trump messages.

    • That can be done in Mathematica too using the built-in function ServiceConnect, but that is not necessary since [1] provides a link to the ingested data used [1]:

      load(url("http://varianceexplained.org/files/trumptweetsdf.rda"))

    • Which leads to the ingesting of an R data frame in the Mathematica-part using RLink.

  2. Adding tags

    • We have to extract device tags for the messages -- each message is associated with one of the tags "Android", "iPad", or "iPhone".

    • Using the message time-stamps each message is associated with time tags corresponding to the creation time month, hour, weekday, etc.

    • Here is summary of the data at this stage:

      enter image description here

  3. Time series and time related distributions

    • We can make several types of time series plots for general insight and to support the main conjecture.

    • Here is a Mathematica made plot for the same statistic computed in [1] that shows differences in tweet posting behavior:

    enter image description here

    • Here are distributions plots of tweets per weekday:

    enter image description here

  4. Classification into sentiments and Facebook topics

    • Using the built-in classifiers of Mathematica each tweet message is associated with a sentiment tag and a Facebook topic tag.

    • In [1] the results of this step are derived in several stages.

    • Here is a mosaic plot for conditional probabilities of devices, topics, and sentiments:

    enter image description here

  5. Device-word association rules

    • Using Association rule learning device tags are associated with words in the tweets.

    • In the Mathematica-part these associations rules are not needed for the sentiment analysis (because of the built-in classifiers.)

    • The association rule mining is done mostly to support and extend the text analysis in [1] and, of course, for comparison purposes.

    • Here is an example of derived association rules together with their most important measures:

    enter image description here

In [1] the sentiments are derived from computed device-word associations, so in [1] the order of steps is 1-2-3-5-4. In Mathematica we do not need the steps 3 and 5 in order to get the sentiments in the 4th step.

Comparison

Using Mathematica for sentiment analysis is much more direct because of the built-in classifiers.

The R-based blog post [1] uses heavily the "pipeline" operator %>% which is kind of a recent addition to R (and it is both fashionable and convenient to use it.) In Mathematica the related operators are Postfix (//), Prefix (@), Infix (~~), Composition (@*), and RightComposition (/*).

Making the time series plots with the R package "ggplot2" requires making special data frames. I am inclined to think that the Mathematica plotting of time series is more direct, but for this task the data wrangling codes in Mathematica and R are fairly comparable.

Generally speaking, the R package "arules" -- used in this project for Associations rule learning -- is somewhat awkward to use:

  • it is data frame centric, does not work directly with lists of lists, and

  • requires the use of factors.

The Apriori implementation in "arules" is much faster than the one in "AprioriAlgorithm.m" -- "arules" uses a more efficient algorithm implemented in C.

References

[1] David Robinson, "Text analysis of Trump's tweets confirms he writes only the (angrier) Android half", (2016), VarianceExplained.org.

[2] Anton Antonov, "Mosaic plots for data visualization", (2014), MathematicaForPrediction at WordPress.

[3] Christian Rudder, Dataclysm, Crown, 2014. ASIN: B00J1IQUX8 .

POSTED BY: Anton Antonov
Answer
1 year ago

enter image description here - another post of yours has been selected for the Staff Picks group, congratulations! We are happy to see you at the top of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: Moderation Team
Answer
1 year ago

Hello @Anton Antonov

I've tried replicating your analysis using only the ServiceConnect framework without success. I'm interested in extracting the source field for a tweet, mentioned in the Twitter API site, but have had little success. Would you happen to know if this field can be extracted with the ServiceConnect framework?

Based on the documentation provided with Mathematica 11 and the links that you shared I've tried to obtain this information to no avail. I was hoping that executing the following code would provide me with at least an idea as to how the Request parameter is handled in Mathematica (hoping that maybe a parameter was not documented), but the results indicate that the source field is not recognized in the ServiceConnect framework (based on the column names shown in the output Dataset).

twitter = ServiceConnect["Twitter", "New"]    
tweets = twitter["TweetList", "Username" -> "realDonaldTrump", 
  "Elements" -> "FullData"]

Have you experimented in such a way with the ServiceConnect framework?

POSTED BY: Isaac Ayala Lozano
Answer
5 months ago

Even if source parameter is not supported, you can quickly create you own utility to get what you want, e.g.

In[1]:= twitter = ServiceConnect["Twitter", "New"]
Out[1]= ServiceObject["Twitter", 
 "ID" -> "connection-29f751249df0b3e87a365f8c21a6b31f"]

In[20]:= url = 
  "https://api.twitter.com/1.1/statuses/user_timeline.json";

In[21]:= userid = Lookup[First@twitter["UserData"], "ID"];

(*build url with required paramaters*)
In[30]:= urlstring = 
  URLBuild[{url}, {"user_id" -> ToString@userid, 
    "screen_name" -> "realDonaldTrump", "count" -> 5}];

(*fetch data using authentication framework*)
In[31]:= result = OAuthClient`oauthdata[twitter, urlstring];

(*post processing result*)
In[33]:= data = ImportString[result[[2]], "RawJSON"];

In[34]:= #["source"] & /@ data
Out[34]= {
"<a href=\"http://twitter.com/download/iphone\" \rel=\"nofollow\">Twitter for iPhone</a>",
"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter \for iPhone</a>",
"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter \for iPhone</a>",
"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"
}
POSTED BY: Damanjit Singh
Answer
5 months ago

Dear Daman,

Thank you for your reply to Isaac!

Isaac, I was on a business trip and had limited time to deal with side projects... I hope Daman's answer works for you.

--Anton

POSTED BY: Anton Antonov
Answer
5 months ago

Thank you very much Daman!

Working with APIs is a first for me so this code is perfect for me to begin to understand the ideas behind creating the functionality that I need.

Once again I'm impressed by how short the code needs to be in Mathematica to get useful results.

POSTED BY: Isaac Ayala Lozano
Answer
5 months ago

Group Abstract Group Abstract