Message Boards Message Boards

[WSSA16] Classify Users by Internet Public Information

Posted 8 years ago

User Clustering by their Social Behaviour

enter image description here

My main goal in this project was to classify internet users using information publicly available on social sites. Among the many alternatives, I chose Reddit because I was able to find a rich database. Database—the whole set of comments posted during May of 2015.

The main steps of my project have been

  • Learning how to work with an SQL database in Mathematica
  • Collecting information about random users
  • Analysing the relevance of the Subreddit present in the data
  • Creating a user feature vector
  • Clustering the users through their feature representation

Learning how to use the Database

To start working on the database, I first checked the database description in order to correctly manipulate the data. To do that I need to know how to open SQLITE files in Mathematica. We must check the database description in order to correctly manipulate the data. As it was my first time, I wanted to start easy to get an idea of what could be done. For that I chose one random person from the dataset, and then tried to collect information about him (or her). I got up to 1000 rows and the columns "subreddit" and "score", but only if the value in the column "author" was "WyaOfWade" (a random user's name).

subreddit.commentsScore=SQLSelect[database,"May2015",{"subreddit","score"},
SQLColumn[{"May2015","author"}]=="WyaOfWade",
"MaxRows"->1000];

Then I grouped the results by the first element ("subreddit") returning the last one ("score"), and computing the length of the vector to see how many comments are on a given

GroupBy[commentsScore, First -> Last, Length]

The result in Column form is

nba->306
nfl->2
CoDCompetitive->32
hiphopheads->17
Boxing->2
GlobalOffensive->6
headphones->8
food->1
pcmasterrace->1
leagueoflegends->4
malefashionadvice->4
DotA2->3
Music->2
OpTicGaming->10
leakthreads->2
AskReddit->1
todayilearned->1
Games->1

So now we have some information about a random user and can for example make a histogram for a more presentable form. Here are the the top 3 subreddits where he/she is commenting.

enter image description here

Starting with Statistics

Now we can gather statistics about more than one user, for example the first 10,000 users in the database. To reduce the noise in the data, we filter them by quantity of comments and take all users that have more than 20 comments in some subreddits.

minLength=20;
commentPerSub=Map[Select[GroupBy[#,First->Last,Length],#>minLength&]&, data];

Short[commentPerSub,2]
<|rx109-><|newsokur->222,BakaNewsJP->37|>,WyaOfWade-><|nba->306,CoDCompetitive->32|>,
   Wicked_Truth-><|politics->156|>,jesse9o3-><|AskReddit->54,worldnews->275,soccer->28|>,
   <<7516>>,Zandock-><|fireemblem->57|>,RandomRem-><||>,Op69dong-><||>|>

This way we find ourseves with 4468 users. Here is the distribution of the amount of subreddits each user commented in. Most users only have significant activity (more than 20 comments) in one subreddit.

enter image description here

Subreddit analysis

It's time to start the Subreddit analysis. I first got the list of all subs (1812 of them are present in my data).

allSubs = Merge[Values[commentPerSub], Total];

Length[allSubs]
1812

Here are the first five by number of comments

TakeLargest[allSubs, 5]

<|"AskReddit" -> 85288, "nba" -> 48097, "nfl" -> 43888, 
 "leagueoflegends" -> 27483, "hockey" -> 18535|>

We can see that AskReddit is very popular sub so user commenting on it are probably not very correlated by interestsTherefore I decided to drop it from the list. Now I can have a look at the plot of each subreddit vs its number of comments.

enter image description here

User representation

Now I have enough information about subreddits and users for creating user vectors which we will use in user's classification.So let's create an empty user vector with zero comments for each sub, then replace the values for each user into empty vector.The new association as values for each subreddit.

(*Create an empty user vector with zero comments for each sub*)
userVectorEmpty = Association[Thread[Keys[allSubs] -> 0]];

Short[userVectorEmpty]
<|newsokur->0,BakaNewsJP->0,nba->0,<<1806>>,shorthairedhotties->0,uwaterloo->0|>

(*Replace values for each user into the empty vector*)
extendedCommentPerSub = Join[userVectorEmpty, #] & /@ Values[commentPerSub];

The new association as values for each subreddit

First[commentPerSub]
Short[First[extendedCommentPerSub]]
<|newsokur->222,BakaNewsJP->37|>
<|newsokur->222,BakaNewsJP->37,nba->0,<<1806>>,shorthairedhotties->0,uwaterloo->0|>

(*Create an empty user vector with zero comments for each sub and remove the \
subreddit names*)
userVectors = Values[Join[userVectorEmpty, #]] & /@ commentPerSub;
(*save the name of the users*)
userNames = Keys[userVectors];
(*remove also the keys with the users, now userVectors is just a normal \
matrix*)
userVectors = Values[userVectors];
(*normalize each vector user by dividing for the total number of comments per \
subreddit*)
userVectors = Transpose[Transpose[userVectors]/Values[allSubs]];

This is a plot of all the users vectors.

enter image description here

User classification

We want to build a similarity function to identify users with common interests

similarity[v1_,v2_]:=v1.v2/(Norm[v1]Norm[v2])

The norm at the denominator are computed separately for efficiency

norms=Norm/@N[userVectors];
normMatrix=Transpose[{norms}].{norms};

This builds the similarity matrix

similarityMatrix=N[userVectors].Transpose[N[userVectors]]/normMatrix;

I want all the values below a certain threshold to be zero. I.e. no connection between the users. I also subtract the diagonal not to connect users with themselves

similarityThreshold=.75;    
m2=Threshold[similarityMatrix-IdentityMatrix[Length[userVectors]],similarityThreshold];

All the rest is set to one

m3=Ceiling[m2];

Now the graph is build from the adjacency matrix

AdjacencyGraph[m3]

enter image description here

In this graph there are 1625 disconnected components

graphComponents=ConnectedGraphComponents[fullGraph];
Length[graphComponents]
1625

If I remove the components with only one element (i. e. the cluster of only one user) I am down to 592

graphComponents=Select[graphComponents,VertexCount[#]>1&];
Length[graphComponents]
592

Each disconnected component represent now a group of users with similar commenting patterns---and hopefully similar interests. Here are five of them

Grid[{#,commentPerSub[[VertexList[#]]]}&/@RandomSample[graphComponents,5]]

enter image description here

As we can see all these group of users have high activity on the same Subreddits.

Attachments:
POSTED BY: Armen Barseghyan
2 Replies

Very interesting!

POSTED BY: Kristina Akopyan

Thanks a lot , your was too .

POSTED BY: Armen Barseghyan
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract