Message Boards

WOLFRAM COMMUNITY

7098 Views

2 Replies

5 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

[WSSA16] Classify Users by Internet Public Information

Armen Barseghyan

Armen Barseghyan, Ayb School

Posted 8 years ago

User Clustering by their Social Behaviour My main goal in this project was to classify internet users using information publicly available on social sites. Among the many alternatives, I chose Reddit because I was able to find a rich database. Databasethe whole set of comments posted during May of 2015. The main steps of my project have been Learning how to work with an SQL database in Mathematica Collecting information about random users Analysing the relevance of the Subreddit present in the data Creating a user feature vector Clustering the users through their feature representation Learning how to use the Database To start working on the database, I first checked the database description in order to correctly manipulate the data. To do that I need to know how to open SQLITE files in Mathematica. We must check the database description in order to correctly manipulate the data. As it was my first time, I wanted to start easy to get an idea of what could be done. For that I chose one random person from the dataset, and then tried to collect information about him (or her). I got up to 1000 rows and the columns "subreddit" and "score", but only if the value in the column "author" was "WyaOfWade" (a random user's name). subreddit.commentsScore=SQLSelect[database,"May2015",{"subreddit","score"}, SQLColumn[{"May2015","author"}]=="WyaOfWade", "MaxRows"->1000]; Then I grouped the results by the first element ("subreddit") returning the last one ("score"), and computing the length of the vector to see how many comments are on a given GroupBy[commentsScore, First -> Last, Length] The result in Column form is nba->306 nfl->2 CoDCompetitive->32 hiphopheads->17 Boxing->2 GlobalOffensive->6 headphones->8 food->1 pcmasterrace->1 leagueoflegends->4 malefashionadvice->4 DotA2->3 Music->2 OpTicGaming->10 leakthreads->2 AskReddit->1 todayilearned->1 Games->1 So now we have some information about a random user and can for example make a histogram for a more presentable form. Here are the the top 3 subreddits where he/she is commenting. Starting with Statistics Now we can gather statistics about more than one user, for example the first 10,000 users in the database. To reduce the noise in the data, we filter them by quantity of comments and take all users that have more than 20 comments in some subreddits. minLength=20; commentPerSub=Map[Select[GroupBy[#,First->Last,Length],#>minLength&]&, data]; Short[commentPerSub,2] <\|rx109-><\|newsokur->222,BakaNewsJP->37\|>,WyaOfWade-><\|nba->306,CoDCompetitive->32\|>, Wicked_Truth-><\|politics->156\|>,jesse9o3-><\|AskReddit->54,worldnews->275,soccer->28\|>, <<7516>>,Zandock-><\|fireemblem->57\|>,RandomRem-><\|\|>,Op69dong-><\|\|>\|> This way we find ourseves with 4468 users. Here is the distribution of the amount of subreddits each user commented in. Most users only have significant activity (more than 20 comments) in one subreddit. Subreddit analysis It's time to start the Subreddit analysis. I first got the list of all subs (1812 of them are present in my data). allSubs = Merge[Values[commentPerSub], Total]; Length[allSubs] 1812 Here are the first five by number of comments TakeLargest[allSubs, 5] <\|"AskReddit" -> 85288, "nba" -> 48097, "nfl" -> 43888, "leagueoflegends" -> 27483, "hockey" -> 18535\|> We can see that AskReddit is very popular sub so user commenting on it are probably not very correlated by interestsTherefore I decided to drop it from the list. Now I can have a look at the plot of each subreddit vs its number of comments. User representation Now I have enough information about subreddits and users for creating user vectors which we will use in user's classification.So let's create an empty user vector with zero comments for each sub, then replace the values for each user into empty vector.The new association as values for each subreddit. (Create an empty user vector with zero comments for each sub) userVectorEmpty = Association[Thread[Keys[allSubs] -> 0]]; Short[userVectorEmpty] <\|newsokur->0,BakaNewsJP->0,nba->0,<<1806>>,shorthairedhotties->0,uwaterloo->0\|> (Replace values for each user into the empty vector) extendedCommentPerSub = Join[userVectorEmpty, #] & /@ Values[commentPerSub]; The new association as values for each subreddit First[commentPerSub] Short[First[extendedCommentPerSub]] <\|newsokur->222,BakaNewsJP->37\|> <\|newsokur->222,BakaNewsJP->37,nba->0,<<1806>>,shorthairedhotties->0,uwaterloo->0\|> (Create an empty user vector with zero comments for each sub and remove the \ subreddit names) userVectors = Values[Join[userVectorEmpty, #]] & /@ commentPerSub; (save the name of the users) userNames = Keys[userVectors]; (remove also the keys with the users, now userVectors is just a normal \ matrix) userVectors = Values[userVectors]; (normalize each vector user by dividing for the total number of comments per \ subreddit) userVectors = Transpose[Transpose[userVectors]/Values[allSubs]]; This is a plot of all the users vectors. User classification We want to build a similarity function to identify users with common interests similarity[v1_,v2_]:=v1.v2/(Norm[v1]Norm[v2]) The norm at the denominator are computed separately for efficiency norms=Norm/@N[userVectors]; normMatrix=Transpose[{norms}].{norms}; This builds the similarity matrix similarityMatrix=N[userVectors].Transpose[N[userVectors]]/normMatrix; I want all the values below a certain threshold to be zero. I.e. no connection between the users. I also subtract the diagonal not to connect users with themselves similarityThreshold=.75; m2=Threshold[similarityMatrix-IdentityMatrix[Length[userVectors]],similarityThreshold]; All the rest is set to one m3=Ceiling[m2]; Now the graph is build from the adjacency matrix AdjacencyGraph[m3] In this graph there are 1625 disconnected components graphComponents=ConnectedGraphComponents[fullGraph]; Length[graphComponents] 1625 If I remove the components with only one element (i. e. the cluster of only one user) I am down to 592 graphComponents=Select[graphComponents,VertexCount[#]>1&]; Length[graphComponents] 592 Each disconnected component represent now a group of users with similar commenting patterns---and hopefully similar interests. Here are five of them Grid[{#,commentPerSub[[VertexList[#]]]}&/@RandomSample[graphComponents,5]] As we can see all these group of users have high activity on the same Subreddits. Attachments: