Message Boards Message Boards

Get the sources for Wolfram's stop words and common words?

Posted 8 years ago

Hello everyone,

I'm doing some text analysis and making use of Mathematica's DeleteStopwords and

WordList["CommonWords"]

Where can I find Wolfram's source for these?

Regards,

Gregory

POSTED BY: Gregory Lypny
3 Replies

Dear Gregory,

I don't know the algorithms but I suppose that the list of stop words the WL uses is:

list = Complement[DictionaryLookup["*"], DeleteStopwords[DictionaryLookup["*"]]]

enter image description here

I can then also build my own function to delete stop words:

Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]

It is about as efficient as the original:

Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]; // RepeatedTiming

(0.69 seconds)

DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]; // RepeatedTiming

(0.74 seconds)

Also it does not give quite the same result

Complement[DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]], 
Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &]]

enter image description here

and

Complement[Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[list, #] &],
DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]]

enter image description here

The last result suggests that this is better:

Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[ToLowerCase /@ list, ToLowerCase[#]] &]

as you can easily see:

Complement[Select[TextWords[ExampleData[{"Text", "AliceInWonderland"}]], ! MemberQ[ToLowerCase /@ list, ToLowerCase[#]] &], 
DeleteStopwords[TextWords[ExampleData[{"Text", "AliceInWonderland"}]]]]

enter image description here

So they are not quite the same, but the "words" that are different are few. You also notice that the last output contains many words that finish with 's, which is easy to fix, too.

The little function that we built above can be fed with your favourite word list.

I know that this does not answer your question, but I hope it is useful.

Best wishes,

Marco

PS: I quite enjoy the fact that Mathematica appears to treat all in-laws as stop words.

POSTED BY: Marco Thiel
Posted 8 years ago

Hi Marco,

Much obliged. I'll definitely have to create a custom stop-words lists. Mathematica's is too broad, and includes words that have potential positive or negative sentiment (e.g., "against") in the context of text analysis.

Regards,

Gregory

POSTED BY: Gregory Lypny

Folks please also take a look at

list = {"cat", "cow"};
StringDelete["cat dog cow horse", list]

which avoids splitting string into words.

POSTED BY: Vitaliy Kaurov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract