Group Abstract

Message Boards

WOLFRAM COMMUNITY

7K Views

9 Replies

9 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Wolfram Language

Count frequency of the second letter based on the first letter

ILYA ZAREZENKO

Posted 6 years ago

Hello everyone, I am trying to get frequency of the second letter based on the first letter. I explored reference and found such solutons for one letter. alist = DictionaryLookup["a" ~~ ___]; secondchars = StringTake[alist,{2}]; Counts[secondchars] it returns \[LeftAssociation]StringTake[a,{2}]->1,a->3,b->335,c->407,d->373,e->47,f->129,g->134,h->9,i->129,j->1,k->3,l->363,m->253,n->634,o->4,p->318,q->27,r->381,s->353,t->183,u->227,v->90,w->54,x->29,y->6,z->7\[RightAssociation] It seems OK. However I don't understand the data type which is returned from Counts function. So I want to iterate through the Alphabet[] and apply this piece of code to a,b,c,d ... z. It terms of C language it is a simple for (i=0; i<N; i++) through an alphabet array and call function above. I read reference and found pure funcation calls. So I tried to write such an expression list = StringTake [#,2] &/@ DictionaryLookup[#~~ ___] &/@ Alphabet[] Counts[list] It doen't work at all. What did I do wrong? I am new person with Wolfram language and data types are completely unclear to me in Wolfram language. I will try to solve this problem by myself but may be someone can help me. Thanks in advance. And it would be perfect if can create a Table where a value in the cell is frequency. For example a b c d a 0 34 12 7 b 12 0 0 0 c 24 0 0 7 d 14 4 0 0

POSTED BY: ILYA ZAREZENKO

9 Replies

Sort By:

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 6 years ago

In order to get the contingency matrices you can use the Wolfram Function Repository function `CrossTabulate`. ResourceFunction["CrossTabulate"]@ Flatten[Map[Partition[Characters[#], 2, 1] &, ToLowerCase[DictionaryLookup["*"]]], 1]

POSTED BY: Anton Antonov

ILYA ZAREZENKO

Posted 6 years ago

Hello Anton, Thanks for your answer. As I see, in your table (here is a small part) there is value 569 on the intersection of b row and b column. I don't know so many words starting with bb. What does it mean? Could you please explain.

POSTED BY: ILYA ZAREZENKO

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 6 years ago

I don't know so many words starting with bb. The code I posted makes overlapping pairs of all characters for a given word. (See: `Partition[Characters[#], 2, 1]` .)

POSTED BY: Anton Antonov

ILYA ZAREZENKO

Posted 6 years ago

Hi Rohit, Thanks for your code and comments, regarding what every piece of code does. I checked the given matrix for Russian alphabet. In general, it looks pretty close. Indeed, in the Russian alphabet, there are no words starting with "??", "??", "??", "??", "??" and some others. However, after a quick view, I found some issues. For example: ??: ?????? (sorrel) ??: ????? (shogun) ??: ????? (scene), ???????? (scenario or script) I pointed them with red rectangles on the picture below I guess, there may be some more missings. I didn't check all of them. Having said that, I don't think there is a problem with code or algorithm. I suppose that Wolfram alphabet may have not all Russian words from the Russian dictionary. All in all, I appreciate your help and interest in my topic, as well as an interesting discussion.

POSTED BY: ILYA ZAREZENKO

ILYA ZAREZENKO

Posted 6 years ago

Hi Rohit, It looks fantastic. Yes, please share your code, I am intersted because I already did almost the same table in Excel. Thanks, Ilya

POSTED BY: ILYA ZAREZENKO

Rohit Namjoshi

Posted 6 years ago

Hi Ilya, Here is the code that I used. I generalized it to work with any language that WL has alphabet and dictionary data for. I verified it on English. Can you please verify that it works correctly for Russian. Thanks! Two letter frequencies Generate association of first two letters of words in the dictionary to frequency of occurrence. First and second letters are restricted to first and last letter in the alphabet. This eliminates words containing capitals or accented characters in the first two letters. For some reason DictionaryLookup[] for English has words which contain characters that are not part of Alphabet[] for English. language = "Russian"; alphabet = Alphabet[language]; numLetters = alphabet // Length; pairCounts = alphabet // Map[DictionaryLookup[{language, # ~~ CharacterRange[First@alphabet, Last@alphabet] ..}] &] // Map[StringTake[#, 2] &] // Map[Counts]; Matrix Plot Several combinations do not occur so we need to add them to the association with a count of zero. pairZeroCounts = alphabet // Tuples[#, 2] & // Map[StringJoin] // AssociationThread[#, ConstantArray[0, numLetters^2]] &; allPairCounts = <\|pairZeroCounts, pairCounts\|>; Generate matrix of frequencies and text strings of frequency values centered over matrix rows and columns. matrixValues = allPairCounts // Values // Partition[#, numLetters] &; epilog = MapIndexed[Text[Style[#, 10], #2 - 1/2] &, Transpose@Reverse@matrixValues, {2}]; Labels, ticks and MatrixPlot. frameLabels = Style[#, 16, Black] & /@ {"Second Letter", "First Letter"}; ticks = Transpose[{Range@numLetters, alphabet // Map[Style[#, 14, Black] &]}]; matrixValues // MatrixPlot[ #, Mesh -> All, FrameTicks -> {ticks, ticks, ticks, ticks}, FrameLabel -> Transpose[{frameLabels, frameLabels}], PlotLegends -> Placed[Style[language <> " Words", 20, Black, Bold], Above], ColorFunction -> "TemperatureMap", ColorRules -> {0 -> White}, ImageSize -> 800, Epilog -> epilog] & Graph edges = pairCounts // Keys // Characters // Apply[DirectedEdge, #, {2}] &; (* Association of second letter to frequency ) weights = pairCounts // Map[KeyMap[StringTake[#, -1] &]]; ( Weight of 1 for second letters that do not occur ) defaultWeights = Thread[alphabet -> ConstantArray[1, numLetters]] // Map[Association]; vertexWeights = MapThread[Association, {defaultWeights, weights}]; weightRange = vertexWeights // MinMax; ( Helper to set VertexSize and VertexStyle *) setProperties[graph_, index_] := Module[{scaledWeights = Rescale[vertexWeights[[First@index]], weightRange]}, SetProperty[graph, {VertexSize -> {v_ :> scaledWeights[v]}, VertexStyle -> {v_ :> (ColorData[{"SolarColors", "Reversed"}]@scaledWeights[v])}}]] G = edges // Map[Graph[#, VertexLabels -> Placed["Name", Below], VertexLabelStyle -> Directive[Black, 16], GraphLayout -> "RadialEmbedding"] &]; G // MapIndexed[setProperties] // Partition[#, UpTo[6]] & // Grid[#, Frame -> All] &

Hi Ilya,

Here is the code that I used. I generalized it to work with any language that WL has alphabet and dictionary data for. I verified it on English. Can you please verify that it works correctly for Russian. Thanks!

Two letter frequencies

Generate association of first two letters of words in the dictionary to frequency of occurrence. First and second letters are restricted to first and last letter in the alphabet. This eliminates words containing capitals or accented characters in the first two letters. For some reason DictionaryLookup[] for English has words which contain characters that are not part of Alphabet[] for English.

language = "Russian";
alphabet = Alphabet[language];
numLetters = alphabet // Length;

pairCounts = 
  alphabet // 
     Map[DictionaryLookup[{language, # ~~ 
          CharacterRange[First@alphabet, Last@alphabet] ..}] &] // 
    Map[StringTake[#, 2] &] // Map[Counts];

Matrix Plot

Several combinations do not occur so we need to add them to the association with a count of zero.

pairZeroCounts = 
  alphabet // Tuples[#, 2] & // Map[StringJoin] // 
   AssociationThread[#, ConstantArray[0, numLetters^2]] &;
allPairCounts = <|pairZeroCounts, pairCounts|>;

Generate matrix of frequencies and text strings of frequency values centered over matrix rows and columns.

matrixValues = allPairCounts // Values // Partition[#, numLetters] &;
epilog = MapIndexed[Text[Style[#, 10], #2 - 1/2] &, Transpose@Reverse@matrixValues, {2}];

Labels, ticks and MatrixPlot.

frameLabels = Style[#, 16, Black] & /@ {"Second Letter", "First Letter"};
ticks = Transpose[{Range@numLetters, alphabet // Map[Style[#, 14, Black] &]}];

matrixValues //
 MatrixPlot[
   #,
   Mesh -> All,
   FrameTicks -> {ticks, ticks, ticks, ticks},
   FrameLabel -> Transpose[{frameLabels, frameLabels}],
   PlotLegends -> 
    Placed[Style[language <> " Words", 20, Black, Bold], Above],
   ColorFunction -> "TemperatureMap",
   ColorRules -> {0 -> White},
   ImageSize -> 800,
   Epilog -> epilog] &

enter image description here

Graph

edges = pairCounts // Keys // Characters // Apply[DirectedEdge, #, {2}] &;

(* Association of second letter to frequency *)
weights = pairCounts // Map[KeyMap[StringTake[#, -1] &]];
(* Weight of 1 for second letters that do not occur *) 
defaultWeights = Thread[alphabet -> ConstantArray[1, numLetters]] // Map[Association];

vertexWeights = MapThread[Association, {defaultWeights, weights}];
weightRange = vertexWeights // MinMax;

(* Helper to set VertexSize and VertexStyle *)
setProperties[graph_, index_] := 
 Module[{scaledWeights = Rescale[vertexWeights[[First@index]], weightRange]},
  SetProperty[graph, 
   {VertexSize -> {v_ :> scaledWeights[v]}, 
    VertexStyle -> {v_ :> (ColorData[{"SolarColors", "Reversed"}]@scaledWeights[v])}}]]

G = edges // Map[Graph[#,
      VertexLabels -> Placed["Name", Below],
      VertexLabelStyle -> Directive[Black, 16],
      GraphLayout -> "RadialEmbedding"] &];

G // MapIndexed[setProperties] // Partition[#, UpTo[6]] & // Grid[#, Frame -> All] &

enter image description here

POSTED BY: Rohit Namjoshi

Rohit Namjoshi

Posted 6 years ago

Hi Ilya, I tried a couple of ways of visualizing this data. `MatrixPlot`. Vertical bands where the second letter is a vowel are clearly visible. `a` is the only first letter that has every letter as the second letter, `e` and `o` are close with only one missing second letter. `re` and `co` are the most frequent. `z` and `j` are the least frequent second letter. `Grid` of `Graph` with vertex size and color based on frequency. I can post the code if you are interested.

POSTED BY: Rohit Namjoshi

ILYA ZAREZENKO

Posted 6 years ago

Thank you very much Rohit for your answer. I tried it and it works. Also I appreciate your explnations and indeed, postfiix form looks better. Best, Ilya

POSTED BY: ILYA ZAREZENKO

Rohit Namjoshi

Posted 6 years ago

Hi Ilya, However I don't understand the data type which is returned from Counts function `Counts` returns an `Association`. In other languages it is called a Hash, or HashMap or Map or Dictionary... It is a set of `key -> value`. This list = StringTake [#,2] &/@ DictionaryLookup[#~~ ___] &/@ Alphabet[]; does not work as expected because of precedence / associativity of operators. You need to parenthesize. list = StringTake[#, 2] & /@ (DictionaryLookup[# ~~ ___] & /@ Alphabet[]); The other problem is that `# ~~ ___` will return single character words so `StringTake` will fail, so only consider words with two or more characters. list = StringTake[#, 2] & /@ (DictionaryLookup[# ~~ __] & /@ Alphabet[]); Finally, `list` is a list of lists so you need to `Map` the `Counts` function. Map[Counts, list] (* {<\|"aa" -> 3, "ab" -> 335, "ac" -> 407, "ad" -> 373, "ae" -> 47, .... *) I prefer to write these kinds of expressions in postfix form. Alphabet[] // Map[DictionaryLookup[# ~~ __] &] // Map[StringTake[#, 2] &] // Map[Counts] To generate the table, take a look at the `Grid` function. You will have to deal with combinations for which there are no counts. If you get stuck, post another question. Since you are new to WL, a good learning resource is Stephen Wolfram's 'An Elementary Introduction to the Wolfram Language' which is available online.

POSTED BY: Rohit Namjoshi

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback