Group Abstract

Message Boards

WOLFRAM COMMUNITY

13.6K Views

9 Replies

18 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Reproduce standard letter-frequencies in English language?

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 9 years ago

Wikipedia article Letter frequency states that according to widely recognized analysis of Concise Oxford dictionary, the sequence of English alphabet letters sorted according to their frequency is: etaoinshrdlcumwfgypbvkjxqz Other sources ( 1, 2 ) give similar results. I noticed @Marco Thiel used `WordList` in this post and obtained a different result. My own effort with the larger dictionary of `DictionaryLookup` yields also a different result. The question is: How can we reproduce standard letter-frequencies in English language and are these results truly standard ? I will show my analysis below. First of all a few assumptions: While dictionaries vary in sizes and exact content, we assume due to their large sizes some approximate universal statistics should emerge for all of them. For example, is it just to assume that most frequent letter is "e" for any published large English dictionary? Count only non-repeating words. Different "InflectedForms" of the same root are fine to count. This is what makes difference between letter frequencies of English Language and of English text corpus. Because "the" being most frequent word adds a higher value to letter "t" frequency, for instance, in the text corpus. But here, and I assume in the mentioned Wikipedia sources, we assume letter frequencies of English Language are in question. And thus we are looking at non-repeating words to guarantee independence LETTER-frequencies from WORD-frequencies. And another question is: are these assumptions aligned with calculations mentioned in Wikipedia? I will use `DictionaryLookup` and first get all words in English dictionary: rawENG = DictionaryLookup[]; rawENG // Length Out[]= 92518 Let's split complex words, delete duplicates, delete single letters, and lower-case: splitENG=Select[Union[Flatten[StringCases[ToLowerCase[rawENG],LetterCharacter..]]],StringLength[#]>1&]; splitENG//Length Out[]= 90813 It still contain non standard English characters: nonREG = Complement[Union[Flatten[ToLowerCase[Characters[splitENG]]]],Alphabet[]] Out[]= {"á", "à", "â", "å", "ä", "ç", "é", "è", "ê", "í", "ï", "ñ", "ó", "ô", "ö", "û", "ü"} Deleting words that contain non standard English characters: dicENG=DeleteCases[splitENG,x_/;ContainsAny[Characters[x],nonREG]]; dicENG//Length RandomSample[dicENG,10] Out[]= 90613 Out[]= {"industrialism", "mathias", "tokenism", "showing", "schmo", "delighting", "seahorse", "longings", "shushing", "interdenominational"} we still get quite large dictionary with more than 90,000 words. Sorted frequencies of letters in this English dictionary: singENGfreq = SortBy[Tally[Flatten[Characters[dicENG]]], Last] {{"q",1419},{"j",1586},{"x",2069},{"z",3380},{"w",6899},{"k",7411},{"v",7774},{"f",10267}, {"y",12242},{"b",15053},{"h",17717},{"m",20785},{"p",21409},{"g",22682},{"u",25434}, {"d",28939},{"c",30597},{"l",40397},{"o",46441},{"t",50834},{"n",54596},{"r",55354}, {"a",59513},{"i",66084},{"s",66365},{"e",87107}} We see the sorted sequence is different from Wikipedia. While "e" is by far the most frequent, "t" lost badly its 2nd place. So what is the reason and how we can reproduce standard result? BarChart[singENGfreq[[All, 2]], BarOrigin -> Left, BaseStyle -> 15, ChartLabels -> singENGfreq[[All, 1]], AspectRatio -> 1, PlotTheme -> "Detailed"]

POSTED BY: Vitaliy Kaurov

9 Replies

Sort By:

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

I think the letter frequency in Wikipedia's article is derived from natural language text corpora. Using a modified version of Vitaliy's code over "Hamlet" we can get a letter frequency distribution very similar to the one shown in discussion's opening. (Although some of the letter ranks are transposed -- "Hamlet" is a short and old text.) text = ExampleData[{"Text", "Hamlet"}]; splitENG = Select[Flatten[StringCases[ToLowerCase[text], LetterCharacter ..]], StringLength[#] > 1 &]; splitENG // Length (* 30773 ) textENG = DeleteCases[splitENG, x_ /; ContainsAny[Characters[x], nonREG]]; textENG // Length RandomSample[dicENG, 10] ( 30773 ) ( {"festal", "impassibility", "ungainliest", "transition", "troubled", \ "egoist", "wonderlands", "minesweeper", "afforests", "aniseed"} ) singENGfreq = SortBy[Tally[Flatten[Characters[textENG]]], Last] ( {{"z", 72}, {"j", 111}, {"x", 175}, {"q", 218}, {"v", 1222}, {"k", 1266}, {"b", 1812}, {"p", 2002}, {"g", 2418}, {"c", 2606}, {"f", 2681}, {"w", 3128}, {"y", 3195}, {"m", 4248}, {"u", 4322}, {"d", 4755}, {"l", 5826}, {"r", 7736}, {"i", 7848}, {"s", 8082}, {"n", 8301}, {"h", 8678}, {"a", 9358}, {"o", 11031}, {"t", 11678}, {"e", 14965}} *) BarChart[Reverse@singENGfreq[[All, 2]], BarOrigin -> Bottom, BaseStyle -> 15, ChartLabels -> Reverse[singENGfreq[[All, 1]]], AspectRatio -> 1, PlotTheme -> "Detailed"]

I think the letter frequency in Wikipedia's article is derived from natural language text corpora.

Using a modified version of Vitaliy's code over "Hamlet" we can get a letter frequency distribution very similar to the one shown in discussion's opening. (Although some of the letter ranks are transposed -- "Hamlet" is a short and old text.)

text = ExampleData[{"Text", "Hamlet"}];

splitENG = 
  Select[Flatten[StringCases[ToLowerCase[text], LetterCharacter ..]], 
   StringLength[#] > 1 &];
splitENG // Length

(* 30773 *)

textENG = 
  DeleteCases[splitENG, x_ /; ContainsAny[Characters[x], nonREG]];
textENG // Length
RandomSample[dicENG, 10]

(* 30773 *)

 (* {"festal", "impassibility", "ungainliest", "transition", "troubled", \
     "egoist", "wonderlands", "minesweeper", "afforests", "aniseed"} *)

singENGfreq = SortBy[Tally[Flatten[Characters[textENG]]], Last]

(* {{"z", 72}, {"j", 111}, {"x", 175}, {"q", 218}, {"v", 1222}, {"k", 
  1266}, {"b", 1812}, {"p", 2002}, {"g", 2418}, {"c", 2606}, {"f", 
  2681}, {"w", 3128}, {"y", 3195}, {"m", 4248}, {"u", 4322}, {"d", 
  4755}, {"l", 5826}, {"r", 7736}, {"i", 7848}, {"s", 8082}, {"n", 
  8301}, {"h", 8678}, {"a", 9358}, {"o", 11031}, {"t", 11678}, {"e", 
  14965}} *)

BarChart[Reverse@singENGfreq[[All, 2]], BarOrigin -> Bottom, 
 BaseStyle -> 15, ChartLabels -> Reverse[singENGfreq[[All, 1]]], 
 AspectRatio -> 1, PlotTheme -> "Detailed"]

enter image description here

POSTED BY: Anton Antonov

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 9 years ago

This does not really contribute much, because I have not used a standard corpus, but ResourceSearch provides lots of texts. And it is really easy to use: texts = Get /@ ResourceSearch["text"]; letters = Flatten[Characters[texts]]; smallletters = ToLowerCase[letters]; letterfreqs = SortBy[Tally[smallletters], Last]; standardchars = Reverse@Select[letterfreqs, MemberQ[CharacterRange["a", "z"], #[[1]]] &]; BarChart[standardchars[[All, 2]], BarOrigin -> Bottom, BaseStyle -> 15, ChartLabels -> standardchars[[All, 1]], AspectRatio -> 1, PlotTheme -> "Detailed"] There are more than 44 million characters. I used minimal thinking and cleaning... Cheers, M. PS: Should we try using the British National Corpus?

This does not really contribute much, because I have not used a standard corpus, but ResourceSearch provides lots of texts. And it is really easy to use:

texts = Get /@ ResourceSearch["text"];
letters = Flatten[Characters[texts]];
smallletters = ToLowerCase[letters];
letterfreqs = SortBy[Tally[smallletters], Last];
standardchars = Reverse@Select[letterfreqs, MemberQ[CharacterRange["a", "z"], #[[1]]] &];
BarChart[standardchars[[All, 2]], BarOrigin -> Bottom, BaseStyle -> 15, ChartLabels -> standardchars[[All, 1]], AspectRatio -> 1, PlotTheme -> "Detailed"]

enter image description here

There are more than 44 million characters. I used minimal thinking and cleaning...

Cheers,

PS: Should we try using the British National Corpus?

POSTED BY: Marco Thiel

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 9 years ago

There clearly are differences resulting from the precise corpus I choose. This website lists a great data resource for some counting exercises and appears to give results very similar to the original post by Vitaliy. I downloaded the *engnews _ 2015 3M-words* data file and get this: wordlist = Import["/Users/thiel/Desktop/eng_news_2015_3M/eng_news_2015_3M-words.txt", "TSV"]; allchars = {#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Flatten[Thread @{ToLowerCase[Characters[ToString[#[[1]]]]], #[[2]]} & /@ Select[wordlist, Length[#] == 4 &][[All, {3, 4}]], 1], First]; and then standardallchars = Reverse@SortBy[Select[allchars, MemberQ[CharacterRange["a", "z"], #[[1]]] &], Last]; BarChart[standardallchars[[All, 2]], BarOrigin -> Bottom, BaseStyle -> 15, ChartLabels -> standardallchars[[All, 1]], AspectRatio -> 1, PlotTheme -> "Detailed"] They offer data for different English speaking countries. It would be interesting to see their differences. Cheers, Marco

There clearly are differences resulting from the precise corpus I choose. This website lists a great data resource for some counting exercises and appears to give results very similar to the original post by Vitaliy. I downloaded the engnews _ 2015 3M-words data file and get this:

wordlist = 
  Import["/Users/thiel/Desktop/eng_news_2015_3M/eng_news_2015_3M-words.txt", "TSV"];
allchars = {#[[1, 1]], Total[#[[All, 2]]]} & /@ 
   GatherBy[Flatten[Thread @{ToLowerCase[Characters[ToString[#[[1]]]]], #[[2]]} & /@ Select[wordlist, Length[#] == 4 &][[All, {3, 4}]], 1], First];

and then

standardallchars = 
 Reverse@SortBy[Select[allchars, MemberQ[CharacterRange["a", "z"], #[[1]]] &], Last];
BarChart[standardallchars[[All, 2]], BarOrigin -> Bottom, 
 BaseStyle -> 15, ChartLabels -> standardallchars[[All, 1]], 
 AspectRatio -> 1, PlotTheme -> "Detailed"]

enter image description here

They offer data for different English speaking countries. It would be interesting to see their differences.

Cheers,

Marco

POSTED BY: Marco Thiel

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 9 years ago

Dear Vitaliy, I know that the single call did not work and the multiple one did work at least up to some small-ish number. In other words: wordfreq = {#, WordFrequencyData[#]} & /@ words[[1 ;; 100]]; works and wordfreq = Select[Transpose[{words, Normal[WordFrequencyData[words]][[All, 2]]}], NumberQ[#[[2]]] &]; appears to time out because of WordFrequencyData[words]] I can cut everything into tiny pieces and do it 100 at a time and it will work ok, but be very slow. If I call it on the entire list of words it always times out. I can do something like WordFrequencyData[words[[1;;100]]]] but it kind of defeats the purpose. I am just running some alternative thing, to see whether I can get around the problem. The main idea is to use frequency data from a corpus. I guess I am still making a mistake interpreting the data ( there is one bit I don't understand yet), but this it the outline of the procedure. I use frequency data from the British National Corpus from this website. The data set is this one. It also contains information about the frequency of different grammatical variations of the words. If I import the file wordfreqsBNC = Import["/Users/thiel/Desktop/1_1_all_fullalpha.txt", "TSV"]; and then clean it up a wee bit: wordfreqsBNC = {If[#[[2]] != "@", #[[2]], #[[4]]], #[[6]], #[[7]]} & /@wordfreqsBNC; it looks like this: wordfreqsBNC[[-27502 ;; -27490]] // TableForm I believe that the second column is a sort of number of occurrences. If that was right this should do the trick: wordfreqsBNCinDictionary = Select[wordfreqsBNC, (DictionaryWordQ[ToString[#[[1]]]] && ! StringContainsQ[ToString[#[[1]]], CharacterRange["0", "9"]]) &]; allcharacters = {#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Flatten[ Thread @{ToLowerCase[Characters[#[[1]]]], #[[2]]} & /@ wordfreqsBNCinDictionary[[All, {1, 2}]], 1], First]; standardchars = Reverse@SortBy[ Select[allcharacters, MemberQ[CharacterRange["a", "z"], #[[1]]] &], Last] This gives: which is obviously very far away from what we expect according to your first post and the analysis of a larger number of texts... BarChart[standardchars[[All, 2]], BarOrigin -> Bottom, BaseStyle -> 15, ChartLabels -> standardchars[[All, 1]], AspectRatio -> 1, PlotTheme -> "Detailed"] There is a little problem with how I use the data. I think that the list gives one number of occurrences for the "main entry" and them other numbers for the different grammatical forms of the words. I might be doing some double counting here. I'll try to sort this out. Cheers, Marco

POSTED BY: Marco Thiel

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 9 years ago

@Marco, true, some data calls can be slow. But in general for those cases that work the fresh first call should be not map ( = many calls, slower), but a putting list as an argument (= single call, faster). On a repeated evaluation for cashed data, the logic can be different in terms of timings. Compare below. I doubt though this info can help to run on the whole dictionary, or I will be able to do it. To tell you more I'd have to dig around a bit. Fresh first call AbsoluteTiming[WordFrequencyData[words[[101 ;; 200]]];] {20.195363`, Null} AbsoluteTiming[WordFrequencyData[#] & /@ words[[201 ;; 300]];] {62.140271`, Null} Repeated cashed call AbsoluteTiming[WordFrequencyData[words[[101 ;; 200]]];] {6.006477`, Null} AbsoluteTiming[WordFrequencyData[#] & /@ words[[201 ;; 300]];] {1.912346`, Null}

POSTED BY: Vitaliy Kaurov

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 9 years ago

Hi @Vitaliy Kaurov , my main issue is that there should actually be some information relating to this in the Wolfram Language. My first idea was to use WordFrequencyData as weights for the dictionary words - being aware that this would still change the letter frequencies as there are grammatical issues such as more "s" because of plural forms and more "-ing" for example. The code would be very simple: words = DictionaryLookup[]; AbsoluteTiming[wordfreq = Select[Transpose[{words, Normal[WordFrequencyData[words]][[All, 2]]}], NumberQ[#[[2]]] &];] letterfreq = {#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Flatten[Thread @{ToLowerCase[Characters[#[[1]]]], #[[2]]} & /@ Select[wordfreq, NumberQ[#[[-1]]] &], 1], First] The problem is that this doesn't work form me. It appears that it doesn't like so many word frequencies being polled. If I only run this for the first say 100 words, it appears to work just fine: wordfreq = {#, WordFrequencyData[#]} & /@ words[[1 ;; 100]]; letterfreq = {#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Flatten[Thread @{ToLowerCase[Characters[#[[1]]]], #[[2]]} & /@ Select[wordfreq, NumberQ[#[[-1]]] &], 1], First]; It also works with the slightly modified: AbsoluteTiming[ wordfreq = Select[Transpose[{words[[1 ;; 100]], Normal[WordFrequencyData[words[[1 ;; 100]]]][[All, 2]]}], NumberQ[#[[2]]] &];] 1000 also seems to work; 10000 does not. I am not quite sure whether it simply times out. Perhaps you could go ahead and run this with some magic internal account and check what you get? Cheers, Marco

POSTED BY: Marco Thiel

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

The web site referenced for computing the letter frequencies in the Wikipedia article talks about cryptography and I strongly assume the frequencies are gathered and computed from a cryptography perspective. Hence, the frequencies have to be about letter appearance in natural texts, not dictionary entries.

POSTED BY: Anton Antonov

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 9 years ago

Thank you very much, Anton! The thought that they use regular text corpus crossed my mind, but the Wikipedia article says: Analysis of entries *in the Concise Oxford dictionary* is published by the compilers. This is a bit confusing. I have not been able to find exact definition of procedure and data used for this standard letter-frequency sequence. If anyone knows - please comment.

POSTED BY: Vitaliy Kaurov

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 9 years ago

This is wonderful, @Marco Thiel, thank you. You and @Anton Antonov completely convinced me that standard frequencies result from tally of text corpus and not dictionary items. I am certain British National Corpus will give the same result.

POSTED BY: Vitaliy Kaurov

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

Reproduce standard letter-frequencies in English language?

Fresh first call

Repeated cashed call