Message Boards Message Boards

Grammar reduction and other cleanup in WordCloud

If you are building a word cloud there are a few tricks. First of all use DeleteStopwords. Compare:

cat = Import["https://en.wikipedia.org/wiki/Cat"];

{WordCloud[cat], WordCloud[DeleteStopwords[cat]]}

enter image description here

Yes, the right one is much better, but now we see the next problem: "cat" vs "cats". I usually use reduce to baseform:

Clear@base;
base[w_] := With[
   {tmp = WordData[w, "BaseForm", "List"]},
   If[(Head[tmp] === Missing) || tmp === {}, w, tmp[[1]]]];
SetAttributes[base, Listable];

I would also consider removing numbers (not a must though) and black-listing. For example Wikipedia pages often noisy with:

blackLIST = {"doi", "ed", "isbn", "pmid"};

I would also use ScalingFunctions -> (#^s &) where $0<s<5$ is a correction that emphasizes or deemphasizes word frequency-size visual (5 is a good number, could be more though). Here we go:

WordCloud[
 DeleteCases[
  base[TextWords[StringDelete[DeleteStopwords[ToLowerCase[cat]], 
  DigitCharacter ..]]],
  Alternatives @@ blackLIST],
 ScalingFunctions -> (#^.3 &)]

enter image description here

I imagine there is more to it. Like statistically finding less meaningful words and removing them. Etc, etc.

Do you have your own tips? Please, share!

POSTED BY: Vitaliy Kaurov
2 Replies

Thanks for sharing your tips, Vitaliy! These "cleaning methods" seem to work very nicely on the whole. I have nothing new to add (haven't played with WordCloud yet), just wanted to say that there's obviously an error in WordData["species","BaseForm"]. Yes, "specie" is a word, but it has to do with coins and other forms of commodity money (i.e. no cats involved as far as humans know). WordData doesn't even return "species" as a baseform of "species", I looked at the complete results. If we did get both baseforms though, we'd need a good technique to decide which one is more likely to be relevant to the topic, but that might actually be doable... (sorry to be a downer, the "specie" error is actually quite funny...)

POSTED BY: Bianca Eifert

Thanks for feedback @Bianca Eifert ! Indeed there are cases when filtering needs to get more sophisticated. We will take a look at what causes WordData["species","BaseForm"].

POSTED BY: Vitaliy Kaurov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract