Message Boards Message Boards

Determine the language of a text using letterfrequency

Posted 1 year ago

For a university course we have to determine the language of a text using the letterfrequency for three languages: Dutch, English and German.
We did this by using the Levenshtein Distance function (EditDistance[]). This gave us the following code

where we compare two strings. We chose 6 letters to compare the frequencies of. "ModelN,E,D" are strings that contain the letter frequencies sorted how they theoretically should be if the text were that language. "gesorteerd" is a string that tells us the actual frequencies of the letters sorted from highest to lowest. The if-loops tell us which language it is, by looking for the smallest Levenshtein Distance between the actual string of frequencies ("gesorteerd") and the theoretical frequency strings ("modelE, N, D").
For this assignment, we must also add unique letters/ lettercombinations for each language to our code. However, if we want to add this to our ModelN/E/D string, these characters will always end up last, as the frequency of these letter combinations/special letters will always be lower than the frequencies of single letters. This means it will have no impact on our code. Our final grade depends on this, so any bit of help would be appreciated.

Attachments:
POSTED BY: Dione Cloots
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract