Message Boards Message Boards

2
|
9868 Views
|
4 Replies
|
2 Total Likes
View groups...
Share
Share this post:

Word cloud for WeChat chat room log

Posted 9 years ago

This has been a problem for me for a while. Every day, I notice there are hundreds or even thousands of unread messages coming from a single chat room in WeChat with hundreds of members. It gets even worse when I have been invited into many large group chat rooms. While it is simply impossible to follow all of the messages, I do not want to be out of the context completely. I am still curious about the overall topics in each chat room. What are the things that these acquaintances/friends are interested in? Is there a trend? How do topics evolve in each chat room? Is this a chat room worth my effort to keep a close eye on?

Finally I decide to take some action on this. I downloaded the WeChat log database from my rooted Android phone, then deciphered and extract the desired data, and used a Chinese word segmentation tool Jieba in python to segment words (these chat logs are mostly Chinese, or a blend of Chinese and English). Then comes the more exciting visual display part by using WordCloud in Wolfram Language (And yes, Wolfram Language works with Chinese like a charm!)

Chinese and unicode (in WeChat, people like to use special unicode symbols, like a basketball, a kiss, a cup or a flower, you name it) tend to create a mess in a csv file. So I ended up generating two text files for the raw data. The first one contains all messages for a chat room, each message is word segmented with space, and it uses separator ‘^^^’ between different messages from different talker. The second one contains the correspondent timestamps for the messages, each timestamp is formatted as “MMM DD YYYY”, and it uses separator ‘^’ between different dates. For privacy concerns, I am just uploading two sample files here.

In Wolfram Language, I created a stop word list to clear any word that is not meaningful enough in a word cloud context to me. I selected the most frequent single characters appear in message files from different chat rooms and union the list. This turns out to be an effective way to delete single chars including punctuations and special unicode symbols.

Here is the code for this:

singles[filedir_] := Module[{words, stally},
   words = Import[filedir];
   stally =
    Tally[Select[
      Flatten[StringSplit[#, " "] & /@ StringSplit[words, "^^^"]],
      Length[CharacterCounts[#]] == 1 &]];
   Take[SortBy[stally, Last], -200][[All, 1]]
   ];

stops[filedirlist_] := Module[{list},
  list = Flatten[singles[#] & /@ filedirlist];
  Union[list, manualList]
  ]

stopWords =
 stops[{“messages1.txt",
   "messages2.txt",
   "messages3.txt"}]

Then I manually chop down the list, and named the final list "stopList" (I want to keep many single-char verbs and nouns, like the verbs “send” and “grab” are important to keep here in the context of sending and grabbing lucky money, which is a popular game in WeChat).

The Complement function does not work with Chinese characters. So instead I used a Do loop and DeleteCases to check messages that have been associated with dates against stopList. Here is the code to generate the manipulatable word cloud. It allows me to easily see what people are discussing during any specific time range in a specific chat room (note that you must enforce the character encoding to UTF8 upon importing the files):

wechatCloudRange[mfiledir_,tfiledir_]:=Module[{messages,messages0,times,times0, mtimes,mtest,mtimes0,msort,i,j,k},
messages=Import[mfiledir, "Text",CharacterEncoding->"UTF8"];
messages0=StringSplit[messages,"^^^"];
times=Import[tfiledir, "Text",CharacterEncoding->"UTF8"];
times0=StringSplit[times,"^"];
mtimes={};
Do[AppendTo[mtimes,{times0[[i]],messages0[[i]]}],{i,times0//Length}];
mtimes0=GroupBy[mtimes,First->Last];
Do[
mtest=StringSplit[StringJoin[mtimes0[[j]]," "]," "];
Do [mtest=DeleteCases[mtest,stopList[[k]]],{k,Length[stopList]}];
mtimes0[Keys[mtimes0][[j]]]=SortBy[Tally[mtest],Last],
{j, Length[mtimes0]}
];
mtimes0=KeySortBy[mtimes0,AbsoluteTime[{#,{"Month","Day","Year"}}]&];
Manipulate[WordCloud[Flatten[Values[mtimes0[[Position[Keys[mtimes0],from][[1,1]];;Position[Keys[mtimes0],to][[1,1]]]]],1],ImageSize->600,ScalingFunctions->"Log"],{from,Keys[mtimes0]},{to,Keys[mtimes0]},ContentSize->{700,600}]
]

Here are screenshots of the final results for two chat rooms (I masked words of real member names):

chatroom 1

chatroom 2

You can try this out with the sample files by commenting out the stopList line or creating a stopList.

It is interesting to notice that while some chat rooms deliver a wide range of topics with rich vocabulary, others are basically "lucky money” rooms with less words. In my case, for the chat room largely focused on lucky money, it is because the members have long been drifted apart in different ways in terms of education and professional life. There are less common subjects for them to talk about, and usually only a handful of people are super active while the majority just lurk around (I also made a word cloud to show the active members). For the ones that are more diverse, it is mainly domain/industry based - people from similar professional/college education background share common interests even in subjects outside of the domain. For example, I happened to notice that one person broadcasted news about the shooting tragedy in Paris in a domain focused group, and created a buzz, while another person did the exact same thing in a “lucky money” room, but it soon faded into the background.

This also helps me to get the general feeling of the topic style and language style of a specific chat room. Suppose I would like to start some conversation in a chat room, I can always warm it up with something that the group have been historically or is recently interested in.

If you have any suggestions on interesting ways to display the data with Wolfram language, please let me know.

Attachments:
POSTED BY: Dan Lou
4 Replies
Posted 9 years ago

Thanks, Sander! I just tested and it is actually working. So I can simplify the line of code to be:

mtest = Complement[mtest, stopList]];

Probably I was testing the Complement function with the earlier version of imported CSV data, and had the wrong impression it does not work, but in fact it is the CSV file that is causing the problem.

POSTED BY: Dan Lou

Note that it might be slightly different as Complement returns a sorted list with all the duplicates also removed... Just check if that is OK in your case; I see you do a Tally after it, so I'm not 100% sure if Complement gives you the desired result.

POSTED BY: Sander Huisman
Posted 9 years ago

Yeah, that's an issue.

Or maybe I can try to tally first, then Complement, then try to get the matched ones from the tallied list.

POSTED BY: Dan Lou

Thanks for sharing!! Pretty neat Why does complement not work for you?

If I grab some Chinese characters from wikipedia: it seems to work:

chars={"?","?","?","?","?","?","?","?","?","?","?","?","?","?","?","?","?","?","?","?","?","?"}
partchars=chars[[;; ;; 3]]
Complement[chars,partchars]
POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract