Message Boards Message Boards

Grammar reduction and other cleanup in WordCloud

If you are building a word cloud there are a few tricks. First of all use DeleteStopwords. Compare:

cat = Import["https://en.wikipedia.org/wiki/Cat"];

{WordCloud[cat], WordCloud[DeleteStopwords[cat]]}

enter image description here

Yes, the right one is much better, but now we see the next problem: "cat" vs "cats". I usually use reduce to baseform:

Clear@base;
base[w_] := With[
   {tmp = WordData[w, "BaseForm", "List"]},
   If[(Head[tmp] === Missing) || tmp === {}, w, tmp[[1]]]];
SetAttributes[base, Listable];

I would also consider removing numbers (not a must though) and black-listing. For example Wikipedia pages often noisy with:

blackLIST = {"doi", "ed", "isbn", "pmid"};

I would also use ScalingFunctions -> (#^s &) where $0<s<5$ is a correction that emphasizes or deemphasizes word frequency-size visual (5 is a good number, could be more though). Here we go:

WordCloud[
 DeleteCases[
  base[TextWords[StringDelete[DeleteStopwords[ToLowerCase[cat]], 
  DigitCharacter ..]]],
  Alternatives @@ blackLIST],
 ScalingFunctions -> (#^.3 &)]

enter image description here

I imagine there is more to it. Like statistically finding less meaningful words and removing them. Etc, etc.

Do you have your own tips? Please, share!

POSTED BY: Vitaliy Kaurov
2 Replies
POSTED BY: Bianca Eifert

Thanks for feedback @Bianca Eifert ! Indeed there are cases when filtering needs to get more sophisticated. We will take a look at what causes WordData["species","BaseForm"].

POSTED BY: Vitaliy Kaurov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract