Message Boards Message Boards

[WSC19] Analysing Global Emoji Usage

Posted 11 days ago
601 Views
|
6 Replies
|
5 Total Likes
|

Introduction:

Emojis over time have become an important form of expressing emotions, especially in the social media world. Understanding how people use emojis and how the usage varies across countries can help us understand human behaviour better. The main aim of this project was to explore which emojis are dominantly popular in a specific geographic area.

Further, a short exploration was made into how different socio-economic factors may affect emoji usage.

Methodology:


All the data was obtained in the form of geolocation tagged tweets from Twitter. Further, the emojis were isolated from each tweet (also based on country), and their usage was analysed.

Initial Attempts

Initially I started off by developing a pseudo dataset, forming a list of countries and their respective emoji usage frequency, randomly generated. This was done as an attempt to better understand possible approaches to solving the problem.

In the beginning, the goal of the study was to explore the emoji usage patterns for a few countries, say the ‘top 10 twitter using’ countries. However, being able to extract emojis from tweets is a more powerful tool than I originally anticipated. So, I decided to do it for all the countries that the Wolfram Language has in its repository.

Extracting Data from Twitter

I used ServiceConnect, a built in function of the Wolfram Language, to connect to Twitter, and then used the “TweetSearch” attribute to extract Twitter data from 226 countries. Unfortunately, around 22 countries’ data was not available, and so are excluded from the current study.

twitter=ServiceConnect[“Twitter”]


twitter["TweetSearch", "Query" -> "", GeoLocation -> #, 

 MaxItems -> 2000 ]

countries = CountryData[];


Block[{results, filename},

 results = 

  twitter["TweetSearch", "Query" -> "", GeoLocation -> #, 

   MaxItems -> 2000 ];

 filename = 

  FileNameJoin[{"/Users/anweshadas/Desktop/Wolfram emoji \

project/Extracting the Emoji/", "Results_" <> #["Name"] <> ".m"}];

 Export[filename, results];

 Pause[120]; ]

All the data that was collected was stored on a local folder on the computer.

Extracting Emoji from Tweets

Extracting emojis from a tweet was by far the most difficult part of the problem. The problem being that all the tweets on Twitter are encoded in UTF-16, whereas the Wolfram Language FromCharacterCode function uses UTF-8.

As a first step, I imported the unicodes for all the emojis into the Mathematica notebook.

string = Import[

   "http://unicode.org/Public/emoji/12.0/emoji-data.txt"];


goodLines = 

  Select[StringTrim /@ StringSplit[string, "\n"] /. "" -> Nothing, 

   StringTake[#, 1] =!= "#" &];


codes = First /@ (StringSplit[#] & /@ goodLines);


allCodesUNICOD = 

  Union@Select[

    With[{d = FromDigits[#, 16] & /@ Flatten[StringSplit[#, "."] /. "" -> Nothing]}, 

        If[Length[d] == 1, d, Range @@ d[[{1, -1}]]]] & /@ codes // 

     Flatten, # >= 9728 &];

Next, I imported the twitter data (which was stored on the computer in the previous step) into the notebook.

filenames = FileNames["*.m", NotebookDirectory[]];



missing = {"Canada", "Chad", "Democratic Republic of the Congo", 

   "Egypt", "Falkland Islands", "Libya", "Mauritania", "Mongolia", 

   "Myanmar", "Nicaragua", "Niger", "Norfolk Island", "Romania", 

   "Somalia", "Svalbard", "Sweden", "Syria", "Tonga", "Turkmenistan", 

   "Tuvalu", "Uzbekistan", "Zambia"};



missingEntity = Interpreter["Country"][#] & /@ missing; 

cData = Complement[ StringCases[filenames, ___ ~~ "/Results_" ~~ x__ ~~ ".m" :> x] // 

    Flatten, missing];



newfilenames = filenames //  Select[MemberQ[cData, StringCases[#, ___ ~~ "/Results_" ~~ x__ ~~ ".m" :> x] // 

       First] &];



alldata = Import[#] & /@ newfilenames;

Next, I created a list of all the countries for which the data was collected.

countries = CountryData[];


newcountries = Select[countries, ! MemberQ[missingEntity, #] &];

Then I extracted the text part from the tweets, and converted them into unicode.

alltext = alldata[[#]][All, "Text"] & /@ Range[alldata // Length];


allcodes = ToCharacterCode[alltext[[#]] // Normal, "Unicode"] & /@ 

   Range[alltext // Length];

Then threaded them as an association.

allthread = AssociationThread[newcountries -> allcodes];

The final step to extracting the emojis was converting all of the UTF-16 codes into UTF-8, which allowed the Wolfram Language to interpret them.

toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff :=

 (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4.



datacleaned = Map[toCodePoint[#] /. _toCodePoint -> First@# & /@ Partition[#, 2, 1] &, allthread, {2}] /. r_Real :> Floor@r;

The output of this step was a list of all the emojis used in all the Tweets for each country.

justEmojis = Cases[#, Alternatives @@ allCodesUNICOD, {2}] & /@ datacleaned;

Exploration: GDP and Literacy Fraction

I feel that emojis are a very powerful outlet towards understanding human behavior. Hence, I wondered if socio economic conditions of a country might influence their emoji usage. In this project, I took the case of two very common social indicators- GDP (Gross Domestic Product) and Literacy Fraction.

Results


The next step was to get a list of number of emojis present in tweets per country.

sortedCountries = Length /@ justEmojis // Normal;

In order to have a better understanding of the data, I decided to calculate the emoji usage density (number of emojis per tweet) for each country.

a = sortedCountries // Values;


theactualnumberoftweets = 

  AssociationThread[newcountries -> Length /@ alldata] // Normal;


b = theactualnumberoftweets // Values;


theemojiratio = 

  Table[N[Part[a, n]/Part[b, n]], {n, Length@newcountries}];

I was also curious to find out that on an average what percentage of tweets contain an emoji. Results suggest that on an average, 63% of tweets use emojis. Next, I plotted a weighted map for emoji usage density.

GeoRegionValuePlot[AssociationThread[newcountries -> theemojiratio], 

 ImageSize -> 500]

enter image description here

Density Plots by Continent

After this, I decided to have an emoji density map for each continent, in order to understand emoji usage frequency in specific regions.

Asia

AsiaGeoRegion = Part[amapfortheemojiratio, #] & /@ Position[newcountries, #] & /@ 

    Entity["GeographicRegion", "Asia"][EntityProperty["GeographicRegion", "Countries"]] // Flatten;



GeoRegionValuePlot[AsiaGeoRegion, ImageSize -> 500]

enter image description here

Europe

EuropeGeoRegion = Part[amapfortheemojiratio, #] & /@ Position[newcountries, #] & /@ Entity["GeographicRegion", "Europe"][ EntityProperty["GeographicRegion", "Countries"]] // Flatten;



GeoRegionValuePlot[EuropeGeoRegion, ImageSize -> 500]

enter image description here

Africa

AfricaGeoRegion = Part[amapfortheemojiratio, #] & /@ Position[newcountries, #] & /@ 

    Entity["GeographicRegion", "Africa"][EntityProperty["GeographicRegion", "Countries"]] // Flatten;



GeoRegionValuePlot[AfricaGeoRegion, ImageSize -> 500]

enter image description here

North America

NorthAmericaGeoRegion = Part[amapfortheemojiratio, #] & /@ Position[newcountries, #] & /@ 

    Entity["GeographicRegion", "NorthAmerica"][EntityProperty["GeographicRegion", "Countries"]] // Flatten;



GeoRegionValuePlot[NorthAmericaGeoRegion, ImageSize -> 500]

enter image description here

South America

SouthAmericaGeoRegion = 

 Part[amapfortheemojiratio, #] & /@ Position[newcountries, #] & /@ 

 Entity["GeographicRegion", "South America"][EntityProperty["GeographicRegion", "Countries"]] // Flatten;



GeoRegionValuePlot[SouthAmericaGeoRegion, ImageSize -> 500]

enter image description here

Most Common Emojis by Country

One of the most important parts of the project was to investigate the most common emoji for each country.

cuteEmojis = KeyMap[FromCharacterCode, #] & /@  ReverseSort /@ Counts /@ justEmojis;

ds = Dataset@cuteEmojis;



ds[1 ;;, Association@MaximalBy[Normal@#, Last] &]

enter image description here

Most Common Emoji Globally

To illustrate the global frequency of specific emoji usage, I plotted a bar chart for the 20 most common emojis.

BarChart[Last /@ Take[Reverse[SortBy[MostCommonEmojis, Last]], 20], 

 ChartLabels -> First /@ Take[Reverse[SortBy[MostCommonEmojis, Last]], 20] , 

 ImageSize -> 750]

enter image description here

GDP and Literacy Fraction

As mentioned in the methods section, as an exploratory question for the project was to try and see if I could find any correlation between the GDP and emoji usage and between literacy ratio and emoji usage.

Unfortunately, no clear trend was observed for the GDP vs emoji usage plot. However, for the literacy fraction vs emoji usage, a positive trend was observed, i.e. emoji usage goes up as the literacy fraction goes up.

GDP vs EmojiRatio

gdp = CountryData[#, "GDP"] & /@ newcountries;



listplotdata = SortBy[Select[Thread[gdp -> theemojiratio], ! MissingQ[#[[1]]] &] //

     Normal, Keys];



gdpdataplotted = ListPlot[List @@@ listplotdata, AxesLabel -> {"GDP", "emojiratio"}, 

  ImageSize -> 500]

enter image description here

Literacy Fraction vs EmojiRatio

literacyfraction = CountryData[#, "LiteracyFraction"] & /@ newcountries;



literaryfractionsorted = SortBy[Select[Thread[literacyfraction -> theemojiratio], ! 

       MissingQ[#[[2]]] &] // Normal, Keys];



literacyfractionplot = ListPlot[List @@@ literaryfractionsorted, 

  AxesLabel -> {"literacyfraction", "theemojiratio"}, ImageSize -> 500]

enter image description here

Conclusion:


In this project, around 220 000 tweets from 226 countries was used. The ‘tears of joy’ emoji was found to be the most popular globally, in addition to being the most used for 104 countries. This was followed by the ‘crying loudly' emoji !

Future Work


This work can of course be improved with more consistent data. In addition to this, one interesting extension of the project could be to explore how the first/official language of a country affects its emoji usage patterns. A possible approach is to try and find a correlation between alphabet length (for eg. 26 for English) and emoji usage density.

Acknowledgement:


A big shout out to my mentor who made this project possible—Rory Foulger. Also, a big ‘Thank You’ to mentors Kyle Keane, Philip Maymin, Christian Pasquel and Mads Bahrami—who too were an integral part of this project.

References:


6 Replies

Wow this is amazing! I never knew you could learn so much from emojis

Posted 10 days ago

Amazing project! :)

Posted 9 days ago

This is amazing! Good job!

Posted 9 days ago

Thanks!

Fetching data process :))

Posted 4 days ago

Your project is great, Anwesha! :)

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract