Message Boards Message Boards

Trying to analyze discussions around Covid using Twitter data

Posted 4 years ago

Hi! A little bit about my idea and what I could figure out with my basic Mathematica knowledge. Hopefully you guys can let me know other solutions for the objectives I want to achieve.

Basically, I would like to check what kind of words or topics are mostly discussed in my country's Twittosphere. I'm really interested in this data because I enjoy doing SciComm and getting data about how the public interacts on the Internet will help me build a better proposal whenever I have to do SciComm expos or anything. My country is also interesting because we're known to not invest to much on science, for example. What reputation do scientists have in a not-so-scientific country? That would also be interesting to know.

To start with this, I picked up 7 search words and hashtags I know people in my country are using from what I could extract from the daily TT.

So, my first idea is to apply wordclouds on imported Twitter data.

I tried to use my Mathematica v.11.2.0 on my Ubuntu computer, but sadly, I don't know why it can't process any of the Twitter code I'm writing in there. So, I decided to start working with a notebook on the Cloud. This is what I got using a script I found somewhere:

twitter= ServiceConnect["Twitter"]
result = twitter["TweetSearch", "Query" -> "#CoronavirusEnPeru", 
      MaxItems -> 100];
WordCloud@Flatten[Normal[StringSplit[#["Text"]] & /@ result]]

This resulted in a nice WordCloud, but it lacks certain filtering. Now, the tweets I'm trying to analyze here are all written in spanish. So, using a list of spanish stepwords, I created a list stepsp.

My goal is to remove all spanish stepwords to obtain a better glimpse. Also, I added some twitter jargon that might be noisy.

This is what I tried to use to remove spanish stepwords, using my list stepsp:

DeleteCases[Normal[result[[All,"Text"]]], Alternatives @@  stepsp] 

And this is where I'm stuck right now. It seems that DeleteCases isn't doing actually anything. I tried to produce a wordcloud from that, but it seems that that computation exceeds what I'm able to do with the Cloud.

These are my questions:

  1. Is DeleteCases a good way to remove stepwords? Why is that function not deleting what I want to delete?
  2. Is there a way to just obtain tweets according to a certain country? I tried using GeoLocation, but I don't know if this is the way to go.
  3. How should I proceed with the Mathematica I have installed in my computer? This is what I get whenever I want to process anything:

    $CharacterEncoding: "The byte sequence {240} could not be interpreted as a character in \ the UTF-8 character encoding."
    

The same code I'm using in the Cloud notebook has been tried in my own computer, but it doesn't work. And it seems that I won't be able to complete these tasks with the Cloud seeing as I might occupy all the memory I have available.

Thanks for any help you could give me!

POSTED BY: Camila Castillo
4 Replies

Hola Camila, I think I found a way to remove these tags from the Twitter before you process the data. I called the Twitter string t. Then this seemed to work; however, I only tried this on a short segment of your Twitter string so it is possible there are other patterns that may also need to be excluded. Good luck.

StringDelete[t, Shortest["RT@" ~~ __ ~~ ":"]]
POSTED BY: Nathan Shpritz

Hola to you, Nathan. I can't thank you enough for your help. I finally managed to make the wordclouds I've been looking for, and they're looking great. I will be retrieving info from here and see what can be discussed these days in matters of pandemic and scientific communication.

Thank you again!

POSTED BY: Camila Castillo

I think this might work for you - I do not have a Twitter account and I am hoping to see what you come up with.

First, I copied your stepwords list, and converted it into a string and then into an association:

To get it into String format:

stepspStep1 = ToString /@ stepsp;
stepspAsString = StringJoin[Riffle[stepspStep1, " "]]

Then create an association as a WordCount:

stepsspAsWordcount = WordCounts[stepspAsString]

Then I took a piece or your Twitter data and created a WordCount association as well:

shortTwitter = WordCounts[t]

And then just took the complement of the keys from the Twitter with the stop words:

shortList = KeyComplement[{shortTwitter, stepsspAsWordcount}]

You can create a WordCloud directly from the shortList:

WordCloud[shortList]

I did notice that one should probably move all letters to lowercase because "Todo" slipped into the WordCloud.

Good luck.

POSTED BY: Nathan Shpritz

Hi Nathan!

First of all, thank you so much for your help. It worked perfectly! I also followed your suggestion about lower cases and it went great. Thank you again.

As a last touch, however, I would like to delete all "@<<STRING>>" usernames in my search, because they are just noisy in what I would like to search for. Searching a bit, I came up with:

CleanttAsString = StringDelete[ttAsString,"@"~~__~~" "]

In which ttAsString is the variable containing the twitter data. However, it seems that I'm just erasing all of my data. Any idea of what might be happening? I'm still trying to get my hand around string patterns.

I'm attaching a new Notebook with the new improvements. Thanks again for the help!

POSTED BY: Camila Castillo
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract