Message Boards Message Boards

Whitelist custom phrases in WordCloud

Imagine you want a WordCloud but need to keep phrases such as "New York" or "Rio de Janeiro" or "tongue-in-cheek". Lets define a simple function:

keep[text_, list_] := StringCases[text, list]~Join~TextWords[DeleteStopwords[StringDelete[text, list]]]

It is important where in the code to place DeleteStopwords not to loose them in your whitelisted phrases. Now let's check the effect on Wikipedia article about the famous statue Christ the Redeemer. Our white list is obvious:

list = {"rio de janeiro", "christ the redeemer"};

Import text:

text = ToLowerCase[Import["https://en.wikipedia.org/wiki/Christ_the_Redeemer_(statue)"]];

and see the result compared to the one without whitelisting:

WordCloud[keep[text, list], WordOrientation -> {{-?/4, ?/4}}]
WordCloud[DeleteStopwords[text], WordOrientation -> {{-?/4, ?/4}}]

enter image description here

POSTED BY: Vitaliy Kaurov
3 Replies

That's nifty, thanks for posting! Of course now you need to automatically analyse which words often occur in direct succession in this text (but maybe not so much in language in general, statistically speaking). That would give a really comprehensive whitelist.

POSTED BY: Bianca Eifert

You are right, this would be a non-trivial but interesting task. Perhaps some fusion of computational linguistics and machine learning. In Wolfram Language there are already tools that we could base this on. For instance proper and other special knowledge bits known to Entity, recognition of free-from-input and Interpreter. For instance if our WordCloud concerns countries we could simply whitelist

EntityValue["Country", "Name"]

Another examples:

enter image description here

POSTED BY: Vitaliy Kaurov

That's a good idea. Names of places, people, notable objects and landmarks might all be recognisable enough, as well as numbers and their units, or dates and such. But in the general case, it's probably quite a challenge.

Actually, now that I think about it, maybe we don't have to make it so complicated. If words occur together a lot, they probably make sense together for a word cloud - even if they always stick together regardless of the specific text. Isn't that more or less the same algorithm used for the WL suggestion bar?

POSTED BY: Bianca Eifert
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract