Group Abstract

Message Boards

WOLFRAM COMMUNITY

8.4K Views

3 Replies

5 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Get the sources for Wolfram's stop words and common words?

Gregory Lypny

Posted 10 years ago

Hello everyone, I'm doing some text analysis and making use of Mathematica's DeleteStopwords and WordList["CommonWords"] Where can I find Wolfram's source for these? Regards, Gregory

POSTED BY: Gregory Lypny

3 Replies

Sort By:

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 10 years ago

Folks please also take a look at list = {"cat", "cow"}; StringDelete["cat dog cow horse", list] which avoids splitting string into words.

POSTED BY: Vitaliy Kaurov

Gregory Lypny

Posted 10 years ago

Hi Marco, Much obliged. I'll definitely have to create a custom stop-words lists. Mathematica's is too broad, and includes words that have potential positive or negative sentiment (e.g., "against") in the context of text analysis. Regards, Gregory

POSTED BY: Gregory Lypny

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 10 years ago

Dear Gregory, I don't know the algorithms but I suppose that the list of stop words the WL uses is: list = Complement[DictionaryLookup[""], DeleteStopwords[DictionaryLookup[""]]] I can then also build my own function to delete stop words: Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &] It is about as efficient as the original: Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]; // RepeatedTiming (0.69 seconds) DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]; // RepeatedTiming (0.74 seconds) Also it does not give quite the same result Complement[DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]], Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]] and Complement[Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &], DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]] The last result suggests that this is better: Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[ToLowerCase /@ list, ToLowerCase[#]] &] as you can easily see: Complement[Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[ToLowerCase /@ list, ToLowerCase[#]] &], DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]] So they are not quite the same, but the "words" that are different are few. You also notice that the last output contains many words that finish with 's, which is easy to fix, too. The little function that we built above can be fed with your favourite word list. I know that this does not answer your question, but I hope it is useful. Best wishes, Marco PS: I quite enjoy the fact that Mathematica appears to treat all in-laws as stop words.

Dear Gregory,

I don't know the algorithms but I suppose that the list of stop words the WL uses is:

list = Complement[DictionaryLookup["*"], DeleteStopwords[DictionaryLookup["*"]]]

enter image description here

I can then also build my own function to delete stop words:

Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]

It is about as efficient as the original:

Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]; // RepeatedTiming

(0.69 seconds)

DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]; // RepeatedTiming

(0.74 seconds)

Also it does not give quite the same result

Complement[DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]], 
Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]]

enter image description here

and

Complement[Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &],
DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]]

enter image description here

The last result suggests that this is better:

Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[ToLowerCase /@ list, ToLowerCase[#]] &]

as you can easily see:

Complement[Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[ToLowerCase /@ list, ToLowerCase[#]] &], 
DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]]

enter image description here

So they are not quite the same, but the "words" that are different are few. You also notice that the last output contains many words that finish with 's, which is easy to fix, too.

The little function that we built above can be fed with your favourite word list.

I know that this does not answer your question, but I hope it is useful.

Best wishes,

Marco

PS: I quite enjoy the fact that Mathematica appears to treat all in-laws as stop words.

POSTED BY: Marco Thiel

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback