Group Abstract Group Abstract

Message Boards Message Boards

Avoid DeleteStopwords to unproperly break hyphenated words?

Hi, friends

Please, consider the example

TextWords["through the mechanism of follow-up of living patients 
the natural history of various diseases of military-medical 
importance"]
DeleteStopwords[%]

First operation performs as needed:

{"through", "the", "mechanism", "of", "follow-up", "of", "living", "patients", "the", "natural", "history", "of", "various", "diseases", "of", "military-medical", "importance"}

But the last cuts the ending of "follow-up":

{"mechanism", "follow-", "living", "patients", "natural", "history", "various", "diseases", "military-medical", "importance"}

Can such the trouble be fixed by standard means of Mathematica text normalization?

POSTED BY: Konstantin Nosov
5 Replies
POSTED BY: Marco Thiel
Posted 7 years ago

That looks like a bug. I would report it to Wolfram Support. One workaround would be to remove trailing hyphens.

sentence = "through the mechanism of follow-up of living patients 
the natural history of various diseases of military-medical 
importance";
sentence // TextWords // DeleteStopwords // StringReplace["-" ~~ EndOfString -> ""]

(* {"mechanism", "follow", "living", "patients", "natural", "history", "various", "diseases", "military-medical", "importance"} *)
POSTED BY: Rohit Namjoshi
Posted 7 years ago

To get the list of built-in stop words.

WordList["Stopwords"]

or

WordData[All, "Stopwords"]
POSTED BY: Rohit Namjoshi
Posted 7 years ago

Hi Konstantin,

I completely agree, which is why I suggested reporting it to Wolfram Support as a bug.

POSTED BY: Rohit Namjoshi

Thanks, Rohit

But I suppose that stop words should not crack the words. If follow-up is not a stop word, DeleteStopwords has to remain it untouched.

POSTED BY: Konstantin Nosov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard