Group Abstract Group Abstract

Message Boards Message Boards

3
|
8.4K Views
|
9 Replies
|
9 Total Likes
View groups...
Share
Share this post:

Count frequency of the second letter based on the first letter

Posted 7 years ago

Hello everyone,

I am trying to get frequency of the second letter based on the first letter. I explored reference and found such solutons for one letter.

alist = DictionaryLookup["a" ~~ ___];
secondchars = StringTake[alist,{2}];
Counts[secondchars]

it returns

\[LeftAssociation]StringTake[a,{2}]->1,a->3,b->335,c->407,d->373,e->47,f->129,g->134,h->9,i->129,j->1,k->3,l->363,m->253,n->634,o->4,p->318,q->27,r->381,s->353,t->183,u->227,v->90,w->54,x->29,y->6,z->7\[RightAssociation]

It seems OK. However I don't understand the data type which is returned from Counts function.

So I want to iterate through the Alphabet[] and apply this piece of code to a,b,c,d ... z. It terms of C language it is a simple for (i=0; i<N; i++) through an alphabet array and call function above.

I read reference and found pure funcation calls. So I tried to write such an expression

list = StringTake [#,2] &/@ DictionaryLookup[#~~ ___] &/@  Alphabet[]
Counts[list]

It doen't work at all.

What did I do wrong?

I am new person with Wolfram language and data types are completely unclear to me in Wolfram language. I will try to solve this problem by myself but may be someone can help me. Thanks in advance.

And it would be perfect if can create a Table where a value in the cell is frequency.

For example

     a    b   c   d
a   0   34 12 7
b  12   0   0   0 
c   24  0  0   7
d  14  4  0   0
POSTED BY: ILYA ZAREZENKO
9 Replies

In order to get the contingency matrices you can use the Wolfram Function Repository function CrossTabulate.

ResourceFunction["CrossTabulate"]@
 Flatten[Map[Partition[Characters[#], 2, 1] &, 
   ToLowerCase[DictionaryLookup["*"]]], 1]

enter image description here

POSTED BY: Anton Antonov
Posted 7 years ago

Hello Anton,

Thanks for your answer.

As I see, in your table (here is a small part)

enter image description here

there is value 569 on the intersection of b row and b column.

I don't know so many words starting with bb.

What does it mean?

Could you please explain.

POSTED BY: ILYA ZAREZENKO

I don't know so many words starting with bb.

The code I posted makes overlapping pairs of all characters for a given word. (See: Partition[Characters[#], 2, 1] .)

POSTED BY: Anton Antonov
Posted 7 years ago
POSTED BY: ILYA ZAREZENKO
Posted 7 years ago

Hi Rohit,

It looks fantastic. Yes, please share your code, I am intersted because I already did almost the same table in Excel.

Thanks,

Ilya

POSTED BY: ILYA ZAREZENKO
Posted 7 years ago

Hi Ilya,

Here is the code that I used. I generalized it to work with any language that WL has alphabet and dictionary data for. I verified it on English. Can you please verify that it works correctly for Russian. Thanks!

Two letter frequencies

Generate association of first two letters of words in the dictionary to frequency of occurrence. First and second letters are restricted to first and last letter in the alphabet. This eliminates words containing capitals or accented characters in the first two letters. For some reason DictionaryLookup[] for English has words which contain characters that are not part of Alphabet[] for English.

language = "Russian";
alphabet = Alphabet[language];
numLetters = alphabet // Length;

pairCounts = 
  alphabet // 
     Map[DictionaryLookup[{language, # ~~ 
          CharacterRange[First@alphabet, Last@alphabet] ..}] &] // 
    Map[StringTake[#, 2] &] // Map[Counts];

Matrix Plot

Several combinations do not occur so we need to add them to the association with a count of zero.

pairZeroCounts = 
  alphabet // Tuples[#, 2] & // Map[StringJoin] // 
   AssociationThread[#, ConstantArray[0, numLetters^2]] &;
allPairCounts = <|pairZeroCounts, pairCounts|>;

Generate matrix of frequencies and text strings of frequency values centered over matrix rows and columns.

matrixValues = allPairCounts // Values // Partition[#, numLetters] &;
epilog = MapIndexed[Text[Style[#, 10], #2 - 1/2] &, Transpose@Reverse@matrixValues, {2}];

Labels, ticks and MatrixPlot.

frameLabels = Style[#, 16, Black] & /@ {"Second Letter", "First Letter"};
ticks = Transpose[{Range@numLetters, alphabet // Map[Style[#, 14, Black] &]}];

matrixValues //
 MatrixPlot[
   #,
   Mesh -> All,
   FrameTicks -> {ticks, ticks, ticks, ticks},
   FrameLabel -> Transpose[{frameLabels, frameLabels}],
   PlotLegends -> 
    Placed[Style[language <> " Words", 20, Black, Bold], Above],
   ColorFunction -> "TemperatureMap",
   ColorRules -> {0 -> White},
   ImageSize -> 800,
   Epilog -> epilog] &

enter image description here

Graph

edges = pairCounts // Keys // Characters // Apply[DirectedEdge, #, {2}] &;

(* Association of second letter to frequency *)
weights = pairCounts // Map[KeyMap[StringTake[#, -1] &]];
(* Weight of 1 for second letters that do not occur *) 
defaultWeights = Thread[alphabet -> ConstantArray[1, numLetters]] // Map[Association];

vertexWeights = MapThread[Association, {defaultWeights, weights}];
weightRange = vertexWeights // MinMax;

(* Helper to set VertexSize and VertexStyle *)
setProperties[graph_, index_] := 
 Module[{scaledWeights = Rescale[vertexWeights[[First@index]], weightRange]},
  SetProperty[graph, 
   {VertexSize -> {v_ :> scaledWeights[v]}, 
    VertexStyle -> {v_ :> (ColorData[{"SolarColors", "Reversed"}]@scaledWeights[v])}}]]

G = edges // Map[Graph[#,
      VertexLabels -> Placed["Name", Below],
      VertexLabelStyle -> Directive[Black, 16],
      GraphLayout -> "RadialEmbedding"] &];

G // MapIndexed[setProperties] // Partition[#, UpTo[6]] & // Grid[#, Frame -> All] &

enter image description here

POSTED BY: Rohit Namjoshi
Posted 7 years ago

Hi Ilya,

I tried a couple of ways of visualizing this data.

MatrixPlot. Vertical bands where the second letter is a vowel are clearly visible. a is the only first letter that has every letter as the second letter, e and o are close with only one missing second letter. re and co are the most frequent. z and j are the least frequent second letter.

enter image description here

Grid of Graph with vertex size and color based on frequency.

enter image description here

I can post the code if you are interested.

POSTED BY: Rohit Namjoshi
Posted 7 years ago

Thank you very much Rohit for your answer. I tried it and it works. Also I appreciate your explnations and indeed, postfiix form looks better.

Best, Ilya

POSTED BY: ILYA ZAREZENKO
Posted 7 years ago
POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard