Message Boards Message Boards

Avoid DeleteStopwords to unproperly break hyphenated words?

Hi, friends

Please, consider the example

TextWords["through the mechanism of follow-up of living patients 
the natural history of various diseases of military-medical 
importance"]
DeleteStopwords[%]

First operation performs as needed:

{"through", "the", "mechanism", "of", "follow-up", "of", "living", "patients", "the", "natural", "history", "of", "various", "diseases", "of", "military-medical", "importance"}

But the last cuts the ending of "follow-up":

{"mechanism", "follow-", "living", "patients", "natural", "history", "various", "diseases", "military-medical", "importance"}

Can such the trouble be fixed by standard means of Mathematica text normalization?

POSTED BY: Konstantin Nosov
5 Replies
Posted 5 years ago

To get the list of built-in stop words.

WordList["Stopwords"]

or

WordData[All, "Stopwords"]
POSTED BY: Rohit Namjoshi

There might be a workaround.

First of all we can determine the list of stop words in the Wolfram Language:

stopWords = ToLowerCase@Complement[DictionaryLookup["*"], DeleteStopwords[DictionaryLookup["*"]]]; 

My first attempt was to use Complement, but it reorders the list of words alphabetically, This, however, does appear to work:

myDeleteStopwords[list_] := Select[list, ! MemberQ[stopWords, ToLowerCase[#]] &]

This might be less efficient than the built in function. Let's try this on AliceInWonderland:

text = TextWords@ExampleData[{"Text", "AliceInWonderland"}];

myDeleteStopwords[text]; // AbsoluteTiming

DeleteStopwords[text]; // AbsoluteTiming

on my machine myDeleteStopwords takes 0.081037 seconds to run and DeleteStopwords takes 0.087887, so that are practically equally fast.

You might also want to look at the stop words that are built into the Wolfram Language. myDeleteStopwords might give you more flexibility if you want to choose different lists.

Cheers,

Marco

POSTED BY: Marco Thiel
Posted 5 years ago

Hi Konstantin,

I completely agree, which is why I suggested reporting it to Wolfram Support as a bug.

POSTED BY: Rohit Namjoshi

Thanks, Rohit

But I suppose that stop words should not crack the words. If follow-up is not a stop word, DeleteStopwords has to remain it untouched.

POSTED BY: Konstantin Nosov
Posted 5 years ago

That looks like a bug. I would report it to Wolfram Support. One workaround would be to remove trailing hyphens.

sentence = "through the mechanism of follow-up of living patients 
the natural history of various diseases of military-medical 
importance";
sentence // TextWords // DeleteStopwords // StringReplace["-" ~~ EndOfString -> ""]

(* {"mechanism", "follow", "living", "patients", "natural", "history", "various", "diseases", "military-medical", "importance"} *)
POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract