Message Boards Message Boards

Language Setting for String Functions (DeleteStopwords in German)?

Posted 8 years ago

I am working with strings in German language. But all the linguistic functions only use English. For Example

WordData[All, "Noun"] 

returns all English nouns. Or DeleteStopwords[] also works only in English. My goal is to get a WordCloud of German texts or websites without the stopwords or just the nouns and verbs. Is there a way I can do this with German strings?

POSTED BY: Daniel Wenczel
2 Replies
Posted 8 years ago

Thanks a lot, very helpful!

POSTED BY: Daniel Wenczel

Hi,

you are quite right that some of the linguistic functions are English only. The OCR has just been extended to include non-English as well. Some other functions can easily be implemented for other languages. Take the DeleteStopwords function for example.

If you take your favourite list of stop words in German you can write your own little function. First I import some German stop words from a website (not saying that this is a suitable list for you.) and extract the words from the page:

stopwords = Flatten[TextWords[Import["http://www.ranks.nl/stopwords/german", "Data"][[2, 2]]]]

Then this little function will do the trick:

stopwordsGerman[text_String] := Select[TextWords[text], ! MemberQ[ToLowerCase /@ stopwords, ToLowerCase[#]] &]

You can now feed it with any text you like - such as the UN Human Rights Charter:

stopwordsGerman[ExampleData[{"Text", "UNHumanRightsGerman"}]]

and get a list of words without the stop words. You can then generate a word cloud from that:

WordCloud[stopwordsGerman[ExampleData[{"Text", "UNHumanRightsGerman"}]], IgnoreCase -> True]

enter image description here

If you use a better list of stop words this should give the type of result you would expect. It does however not select for types of words such as verbs. For that you would need a list of words with further info on them.

Cheers,

Marco

POSTED BY: Marco Thiel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract