Message Boards

WOLFRAM COMMUNITY

8613 Views

5 Replies

3 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Avoid DeleteStopwords to unproperly break hyphenated words?

Konstantin Nosov

Konstantin Nosov, V. N. Karazin Kharkov National University

Posted 5 years ago

Hi, friends Please, consider the example TextWords["through the mechanism of follow-up of living patients the natural history of various diseases of military-medical importance"] DeleteStopwords[%] First operation performs as needed: {"through", "the", "mechanism", "of", "follow-up", "of", "living", "patients", "the", "natural", "history", "of", "various", "diseases", "of", "military-medical", "importance"} But the last cuts the ending of "follow-up": {"mechanism", "follow-", "living", "patients", "natural", "history", "various", "diseases", "military-medical", "importance"} Can such the trouble be fixed by standard means of Mathematica text normalization?

POSTED BY: Konstantin Nosov

5 Replies

Sort By:

Rohit Namjoshi

Posted 5 years ago

To get the list of built-in stop words. WordList["Stopwords"] or WordData[All, "Stopwords"]

POSTED BY: Rohit Namjoshi

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 5 years ago

There might be a workaround. First of all we can determine the list of stop words in the Wolfram Language: stopWords = ToLowerCase@Complement[DictionaryLookup[""], DeleteStopwords[DictionaryLookup[""]]]; My first attempt was to use Complement, but it reorders the list of words alphabetically, This, however, does appear to work: myDeleteStopwords[list_] := Select[list, ! MemberQ[stopWords, ToLowerCase[#]] &] This might be less efficient than the built in function. Let's try this on AliceInWonderland: text = TextWords@ExampleData[{"Text", "AliceInWonderland"}]; myDeleteStopwords[text]; // AbsoluteTiming DeleteStopwords[text]; // AbsoluteTiming on my machine myDeleteStopwords takes 0.081037 seconds to run and DeleteStopwords takes 0.087887, so that are practically equally fast. You might also want to look at the stop words that are built into the Wolfram Language. myDeleteStopwords might give you more flexibility if you want to choose different lists. Cheers, Marco

POSTED BY: Marco Thiel

Rohit Namjoshi

Posted 5 years ago

Hi Konstantin, I completely agree, which is why I suggested reporting it to Wolfram Support as a bug.

POSTED BY: Rohit Namjoshi

Konstantin Nosov

Konstantin Nosov, V. N. Karazin Kharkov National University

Posted 5 years ago

Thanks, Rohit But I suppose that stop words should not crack the words. If follow-up is not a stop word, DeleteStopwords has to remain it untouched.

POSTED BY: Konstantin Nosov

Rohit Namjoshi

Posted 5 years ago

That looks like a bug. I would report it to Wolfram Support. One workaround would be to remove trailing hyphens. sentence = "through the mechanism of follow-up of living patients the natural history of various diseases of military-medical importance"; sentence // TextWords // DeleteStopwords // StringReplace["-" ~~ EndOfString -> ""] (* {"mechanism", "follow", "living", "patients", "natural", "history", "various", "diseases", "military-medical", "importance"} *)

That looks like a bug. I would report it to Wolfram Support. One workaround would be to remove trailing hyphens.

sentence = "through the mechanism of follow-up of living patients 
the natural history of various diseases of military-medical 
importance";
sentence // TextWords // DeleteStopwords // StringReplace["-" ~~ EndOfString -> ""]

(* {"mechanism", "follow", "living", "patients", "natural", "history", "various", "diseases", "military-medical", "importance"} *)

POSTED BY: Rohit Namjoshi

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Group Abstract

Feedback