Message Boards Message Boards


[✓] Count syllables in speeches?

Posted 13 days ago
5 Replies
2 Total Likes

wrt text analysis, want to count syllables in speeches.

saw WordData's hyphenation option, but the elementary task of applying it to my list is just out of my reach (as I just started to work with text analysis and the Language.)


5 Replies

I found that WordData[word,"Hyphenation"] usually divides words into syllables correctly, with three caveats. The word must have hyphenation info in WordData, some short 2-syllable words are returned without a hyphenation division, and some words are too short to hyphenate.

WordData[#, "Hyphenation"] & /@ {"wished", "over", "of"}

{Missing["NotAvailable"], {"over"}, Missing["NotAvailable"]}

It is possible to overcome these issues. If you absolutely need syllable information for every word, then I recommend that you begin with WordData[word,"Hyphenation"]. Then retrieve WordData[word,"PhoneticForm"]. This gives the International Phonetic Alphabet version of the word, if it is available. The number of syllables will correspond to the number of vowel sounds in the word. In the case of words like "over", this will be more accurate.

ipaVowels = {"aɪ", "aʊ", "eɪ", "ɔɪ", "oʊ", "ɐ", "ɑ", "ɒ", "ɔ", "ɘ", "ə", "ɛ", "ɜ", "ɝ", "ɞ", "ɤ", "ɨ", "ɪ", "ɯ", "ɵ", "ɶ", "ʉ", "ʊ", "ʌ", "ʏ", "a", "æ", "e", "i", "o", "œ", "ø", "u", "y"};
phon = WordData[#, "PhoneticForm"] & /@ {"wished", "over", "of"};
{#, StringCount[#, ipaVowels]} & /@ phon

{{"wˈɪʃt", 1}, {"ˈoʊvɝ", 2}, {"ˈʌv", 1}}

Words that are not in WordData or have neither hyphenation nor phonetic values can be dealt with in a crude counting of likely vowel sound letter combinations. In other words, the word "Lenore" likely contains two syllables because of the arrangement of vowels, final -e usually being silent.

I recommend that you check out my post from last week entitled "Computer Analysis of Poetry — Part 1: Metrical Pattern" for an extensive example.

I think that my approach can be improved upon even more by the use of machine learning, using WordData values for training data.

Hopefully this gave you enough information to decide which approach will work for you.

thank you

I now have a list of {nation, 3, future, 2}

I just need to add the results, so looking in documentation.

There is a nice work on this subject in "Un Divisor Silábico (Spanish)" from the Wolfram Demonstrations Project Contributed by: Jaime Rangel-Mondragon, which has the drawback of being written in Spanish. He has devised a number of clever rules for syllabic decomposition of words in Spanish which work extremely well. He doesn't go farther, but I did some work (unpublished) to apply his rules to longer texts and obtain interesting (IMHO) statistics. For example, the total number of different syllables for the 86,016 words which make up the Spanish language dictionary used by Wolfram is 3,707 (including repetitions due to the presence of accents, proper names, and some words originated in foreign languages). Unfortunately, there was no hectic or even mildly enthusiastic response from specialists in the field, so I didn't pursue the matter further.

thank you

worked on determining syllables in English famous speeches, for the staff in that department

realized need for ; in many statements so as not to produce unneeded output

It looks like you are satisfied with the answers so far. I continued to work on the problem though to come up with the best possible (non-machine-learning) solution. The first line takes a list of words or just run as is.

wds = RandomWord[10];
ipaVs = ipaVowels = {"aɪ", "aʊ", "eɪ", "ɔɪ", "oʊ", "ɐ", "ɑ", "ɒ", "ɔ", "ɘ", "ə", "ɛ", "ɜ", "ɝ", "ɞ", "ɤ", "ɨ", "ɪ","ɯ", "ɵ", "ɶ", "ʉ", "ʊ", "ʌ", "ʏ", "a", "æ", "e", "i", "o", "œ", "ø", "u", "y"};
dip = {"ai", "au", "ay", "ea", "ee", "ei", "eu", "ey", "ie", "io", "oa", "oe", "oi", "oo", "ou", "oy", "ua", "ue", "ui", "uy"};
vow = {"a", "e", "i", "o", "u", "y"};
hyp[wd_] := (h = WordData[wd, "Hyphenation"] /. Missing[_] -> {}; {h, Length[h]});
ipa[wd_] := (p = WordData[wd, "PhoneticForm"] /. Missing[_] -> "X"; {p, StringCount[p, ipaVs]});
reg[wd_] := {wd, 
     {"e" ~~ EndOfString -> Nothing, "-" -> Nothing, "qu" -> 0, "eness" -> 1, "ement" -> 1,
     {"p", "b", "c", "d", "f", "g", "k", "s", "t", "z"} ~~ "le" ~~EndOfString -> 1, 
     {"d", "t"} ~~ "ed" ~~ EndOfString -> 1, "ed" ~~ EndOfString -> 0, dip -> 1,
     vow -> 1, Except[vow] .. -> 0}]};
calcScr[cts_] := (
   If[cts[[1]] == cts[[2]] > 0, Return[{cts[[1]], "very high"}]];
   If[cts[[1]] == 1 && cts[[2]] == 2, Return[{2, "high"}]];
   If[cts[[1]] == 0 && cts[[2]] == cts[[3]], Return[{cts[[2]], "very high"}]];
   If[cts[[1]] > 0 && cts[[2]] != cts[[1]], Return[{cts[[1]], "high"}]];
   If[cts[[1]] == 0 && cts[[2]] == 0, Return[{cts[[3]], "medium"}]];
   Return[{cts[[2]], "high"}]);
cts = {#, hyp[#][[2]], ipa[#][[2]], reg[#][[2]]} & /@ ToLowerCase[wds];
scr = ({#[[1]], calcScr[#[[2 ;; 4]]]} &) /@ cts;
dts = Dataset[
  Association[#[[1]] -> <|"syllables" -> #[[2, 1]],  "confidence" -> #[[2, 2]]|> & /@ scr]]

Output is in the form of a dataset, but you can change that to your needs:

enter image description here

I have tested this on about 500 random words and only found 3 wrongly assessed words: nationalism, socialism, and nth. Though I think it is probably overkill for your question, it was a fun side project. Thanks. : )

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract