Message Boards Message Boards

3
|
6260 Views
|
9 Replies
|
9 Total Likes
View groups...
Share
Share this post:

Count frequency of the second letter based on the first letter

Posted 5 years ago

Hello everyone,

I am trying to get frequency of the second letter based on the first letter. I explored reference and found such solutons for one letter.

alist = DictionaryLookup["a" ~~ ___];
secondchars = StringTake[alist,{2}];
Counts[secondchars]

it returns

\[LeftAssociation]StringTake[a,{2}]->1,a->3,b->335,c->407,d->373,e->47,f->129,g->134,h->9,i->129,j->1,k->3,l->363,m->253,n->634,o->4,p->318,q->27,r->381,s->353,t->183,u->227,v->90,w->54,x->29,y->6,z->7\[RightAssociation]

It seems OK. However I don't understand the data type which is returned from Counts function.

So I want to iterate through the Alphabet[] and apply this piece of code to a,b,c,d ... z. It terms of C language it is a simple for (i=0; i<N; i++) through an alphabet array and call function above.

I read reference and found pure funcation calls. So I tried to write such an expression

list = StringTake [#,2] &/@ DictionaryLookup[#~~ ___] &/@  Alphabet[]
Counts[list]

It doen't work at all.

What did I do wrong?

I am new person with Wolfram language and data types are completely unclear to me in Wolfram language. I will try to solve this problem by myself but may be someone can help me. Thanks in advance.

And it would be perfect if can create a Table where a value in the cell is frequency.

For example

     a    b   c   d
a   0   34 12 7
b  12   0   0   0 
c   24  0  0   7
d  14  4  0   0
POSTED BY: ILYA ZAREZENKO
9 Replies
Posted 5 years ago

Hi Ilya,

However I don't understand the data type which is returned from Counts function

Counts returns an Association. In other languages it is called a Hash, or HashMap or Map or Dictionary... It is a set of key -> value.

This

list = StringTake [#,2] &/@ DictionaryLookup[#~~ ___] &/@  Alphabet[];

does not work as expected because of precedence / associativity of operators. You need to parenthesize.

list = StringTake[#, 2] & /@ (DictionaryLookup[# ~~ ___] & /@ Alphabet[]);

The other problem is that # ~~ ___ will return single character words so StringTake will fail, so only consider words with two or more characters.

list = StringTake[#, 2] & /@ (DictionaryLookup[# ~~ __] & /@ Alphabet[]);

Finally, list is a list of lists so you need to Map the Counts function.

Map[Counts, list]
(* {<|"aa" -> 3, "ab" -> 335, "ac" -> 407, "ad" -> 373, "ae" -> 47, .... *)

I prefer to write these kinds of expressions in postfix form.

Alphabet[] // Map[DictionaryLookup[# ~~ __] &] // Map[StringTake[#, 2] &] // Map[Counts]

To generate the table, take a look at the Grid function. You will have to deal with combinations for which there are no counts. If you get stuck, post another question.

Since you are new to WL, a good learning resource is Stephen Wolfram's 'An Elementary Introduction to the Wolfram Language' which is available online.

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Thank you very much Rohit for your answer. I tried it and it works. Also I appreciate your explnations and indeed, postfiix form looks better.

Best, Ilya

POSTED BY: ILYA ZAREZENKO
Posted 5 years ago

Hi Ilya,

I tried a couple of ways of visualizing this data.

MatrixPlot. Vertical bands where the second letter is a vowel are clearly visible. a is the only first letter that has every letter as the second letter, e and o are close with only one missing second letter. re and co are the most frequent. z and j are the least frequent second letter.

enter image description here

Grid of Graph with vertex size and color based on frequency.

enter image description here

I can post the code if you are interested.

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Hi Rohit,

It looks fantastic. Yes, please share your code, I am intersted because I already did almost the same table in Excel.

Thanks,

Ilya

POSTED BY: ILYA ZAREZENKO
Posted 5 years ago

Hi Ilya,

Here is the code that I used. I generalized it to work with any language that WL has alphabet and dictionary data for. I verified it on English. Can you please verify that it works correctly for Russian. Thanks!

Two letter frequencies

Generate association of first two letters of words in the dictionary to frequency of occurrence. First and second letters are restricted to first and last letter in the alphabet. This eliminates words containing capitals or accented characters in the first two letters. For some reason DictionaryLookup[] for English has words which contain characters that are not part of Alphabet[] for English.

language = "Russian";
alphabet = Alphabet[language];
numLetters = alphabet // Length;

pairCounts = 
  alphabet // 
     Map[DictionaryLookup[{language, # ~~ 
          CharacterRange[First@alphabet, Last@alphabet] ..}] &] // 
    Map[StringTake[#, 2] &] // Map[Counts];

Matrix Plot

Several combinations do not occur so we need to add them to the association with a count of zero.

pairZeroCounts = 
  alphabet // Tuples[#, 2] & // Map[StringJoin] // 
   AssociationThread[#, ConstantArray[0, numLetters^2]] &;
allPairCounts = <|pairZeroCounts, pairCounts|>;

Generate matrix of frequencies and text strings of frequency values centered over matrix rows and columns.

matrixValues = allPairCounts // Values // Partition[#, numLetters] &;
epilog = MapIndexed[Text[Style[#, 10], #2 - 1/2] &, Transpose@Reverse@matrixValues, {2}];

Labels, ticks and MatrixPlot.

frameLabels = Style[#, 16, Black] & /@ {"Second Letter", "First Letter"};
ticks = Transpose[{Range@numLetters, alphabet // Map[Style[#, 14, Black] &]}];

matrixValues //
 MatrixPlot[
   #,
   Mesh -> All,
   FrameTicks -> {ticks, ticks, ticks, ticks},
   FrameLabel -> Transpose[{frameLabels, frameLabels}],
   PlotLegends -> 
    Placed[Style[language <> " Words", 20, Black, Bold], Above],
   ColorFunction -> "TemperatureMap",
   ColorRules -> {0 -> White},
   ImageSize -> 800,
   Epilog -> epilog] &

enter image description here

Graph

edges = pairCounts // Keys // Characters // Apply[DirectedEdge, #, {2}] &;

(* Association of second letter to frequency *)
weights = pairCounts // Map[KeyMap[StringTake[#, -1] &]];
(* Weight of 1 for second letters that do not occur *) 
defaultWeights = Thread[alphabet -> ConstantArray[1, numLetters]] // Map[Association];

vertexWeights = MapThread[Association, {defaultWeights, weights}];
weightRange = vertexWeights // MinMax;

(* Helper to set VertexSize and VertexStyle *)
setProperties[graph_, index_] := 
 Module[{scaledWeights = Rescale[vertexWeights[[First@index]], weightRange]},
  SetProperty[graph, 
   {VertexSize -> {v_ :> scaledWeights[v]}, 
    VertexStyle -> {v_ :> (ColorData[{"SolarColors", "Reversed"}]@scaledWeights[v])}}]]

G = edges // Map[Graph[#,
      VertexLabels -> Placed["Name", Below],
      VertexLabelStyle -> Directive[Black, 16],
      GraphLayout -> "RadialEmbedding"] &];

G // MapIndexed[setProperties] // Partition[#, UpTo[6]] & // Grid[#, Frame -> All] &

enter image description here

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Hi Rohit,

Thanks for your code and comments, regarding what every piece of code does.

I checked the given matrix for Russian alphabet. In general, it looks pretty close. Indeed, in the Russian alphabet, there are no words starting with "??", "??", "??", "??", "??" and some others.

However, after a quick view, I found some issues.

For example:
??: ?????? (sorrel)
??: ????? (shogun)
??: ????? (scene), ???????? (scenario or script)

I pointed them with red rectangles on the picture below

enter image description here

I guess, there may be some more missings. I didn't check all of them.

Having said that, I don't think there is a problem with code or algorithm. I suppose that Wolfram alphabet may have not all Russian words from the Russian dictionary.

All in all, I appreciate your help and interest in my topic, as well as an interesting discussion.

POSTED BY: ILYA ZAREZENKO

In order to get the contingency matrices you can use the Wolfram Function Repository function CrossTabulate.

ResourceFunction["CrossTabulate"]@
 Flatten[Map[Partition[Characters[#], 2, 1] &, 
   ToLowerCase[DictionaryLookup["*"]]], 1]

enter image description here

POSTED BY: Anton Antonov
Posted 5 years ago

Hello Anton,

Thanks for your answer.

As I see, in your table (here is a small part)

enter image description here

there is value 569 on the intersection of b row and b column.

I don't know so many words starting with bb.

What does it mean?

Could you please explain.

POSTED BY: ILYA ZAREZENKO

I don't know so many words starting with bb.

The code I posted makes overlapping pairs of all characters for a given word. (See: Partition[Characters[#], 2, 1] .)

POSTED BY: Anton Antonov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract