Group Abstract

Message Boards

WOLFRAM COMMUNITY

9.4K Views

5 Replies

2 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

[?] Count syllables in speeches?

Jay Weininger

Jay Weininger, Santa Fe College

Posted 6 years ago

wrt text analysis, want to count syllables in speeches. saw WordData's hyphenation option, but the elementary task of applying it to my list is just out of my reach (as I just started to work with text analysis and the Language.) thanks

POSTED BY: Jay Weininger

5 Replies

Sort By:

Mark Greenberg

Posted 6 years ago

It looks like you are satisfied with the answers so far. I continued to work on the problem though to come up with the best possible (non-machine-learning) solution. The first line takes a list of words or just run as is. wds = RandomWord[10]; ipaVs = ipaVowels = {"a?", "a?", "e?", "??", "o?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?","?", "?", "?", "?", "?", "?", "?", "a", "æ", "e", "i", "o", "", "ø", "u", "y"}; dip = {"ai", "au", "ay", "ea", "ee", "ei", "eu", "ey", "ie", "io", "oa", "oe", "oi", "oo", "ou", "oy", "ua", "ue", "ui", "uy"}; vow = {"a", "e", "i", "o", "u", "y"}; hyp[wd_] := (h = WordData[wd, "Hyphenation"] /. Missing[_] -> {}; {h, Length[h]}); ipa[wd_] := (p = WordData[wd, "PhoneticForm"] /. Missing[_] -> "X"; {p, StringCount[p, ipaVs]}); reg[wd_] := {wd, Total@StringCases[wd, {"e" ~~ EndOfString -> Nothing, "-" -> Nothing, "qu" -> 0, "eness" -> 1, "ement" -> 1, {"p", "b", "c", "d", "f", "g", "k", "s", "t", "z"} ~~ "le" ~~EndOfString -> 1, {"d", "t"} ~~ "ed" ~~ EndOfString -> 1, "ed" ~~ EndOfString -> 0, dip -> 1, vow -> 1, Except[vow] .. -> 0}]}; calcScr[cts_] := ( If[cts[[1]] == cts[[2]] > 0, Return[{cts[[1]], "very high"}]]; If[cts[[1]] == 1 && cts[[2]] == 2, Return[{2, "high"}]]; If[cts[[1]] == 0 && cts[[2]] == cts[[3]], Return[{cts[[2]], "very high"}]]; If[cts[[1]] > 0 && cts[[2]] != cts[[1]], Return[{cts[[1]], "high"}]]; If[cts[[1]] == 0 && cts[[2]] == 0, Return[{cts[[3]], "medium"}]]; Return[{cts[[2]], "high"}]); cts = {#, hyp[#][[2]], ipa[#][[2]], reg[#][[2]]} & /@ ToLowerCase[wds]; scr = ({#[[1]], calcScr[#[[2 ;; 4]]]} &) /@ cts; dts = Dataset[ Association[#[[1]] -> <\|"syllables" -> #[[2, 1]], "confidence" -> #[[2, 2]]\|> & /@ scr]] Output is in the form of a dataset, but you can change that to your needs: I have tested this on about 500 random words and only found 3 wrongly assessed words: nationalism, socialism, and nth. Though I think it is probably overkill for your question, it was a fun side project. Thanks. : )

It looks like you are satisfied with the answers so far. I continued to work on the problem though to come up with the best possible (non-machine-learning) solution. The first line takes a list of words or just run as is.

wds = RandomWord[10];
ipaVs = ipaVowels = {"a?", "a?", "e?", "??", "o?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?","?", "?", "?", "?", "?", "?", "?", "a", "æ", "e", "i", "o", "", "ø", "u", "y"};
dip = {"ai", "au", "ay", "ea", "ee", "ei", "eu", "ey", "ie", "io", "oa", "oe", "oi", "oo", "ou", "oy", "ua", "ue", "ui", "uy"};
vow = {"a", "e", "i", "o", "u", "y"};
hyp[wd_] := (h = WordData[wd, "Hyphenation"] /. Missing[_] -> {}; {h, Length[h]});
ipa[wd_] := (p = WordData[wd, "PhoneticForm"] /. Missing[_] -> "X"; {p, StringCount[p, ipaVs]});
reg[wd_] := {wd, 
   Total@StringCases[wd,
     {"e" ~~ EndOfString -> Nothing, "-" -> Nothing, "qu" -> 0, "eness" -> 1, "ement" -> 1,
     {"p", "b", "c", "d", "f", "g", "k", "s", "t", "z"} ~~ "le" ~~EndOfString -> 1, 
     {"d", "t"} ~~ "ed" ~~ EndOfString -> 1, "ed" ~~ EndOfString -> 0, dip -> 1,
     vow -> 1, Except[vow] .. -> 0}]};
calcScr[cts_] := (
   If[cts[[1]] == cts[[2]] > 0, Return[{cts[[1]], "very high"}]];
   If[cts[[1]] == 1 && cts[[2]] == 2, Return[{2, "high"}]];
   If[cts[[1]] == 0 && cts[[2]] == cts[[3]], Return[{cts[[2]], "very high"}]];
   If[cts[[1]] > 0 && cts[[2]] != cts[[1]], Return[{cts[[1]], "high"}]];
   If[cts[[1]] == 0 && cts[[2]] == 0, Return[{cts[[3]], "medium"}]];
   Return[{cts[[2]], "high"}]);
cts = {#, hyp[#][[2]], ipa[#][[2]], reg[#][[2]]} & /@ ToLowerCase[wds];
scr = ({#[[1]], calcScr[#[[2 ;; 4]]]} &) /@ cts;
dts = Dataset[
  Association[#[[1]] -> <|"syllables" -> #[[2, 1]],  "confidence" -> #[[2, 2]]|> & /@ scr]]

Output is in the form of a dataset, but you can change that to your needs:

enter image description here

I have tested this on about 500 random words and only found 3 wrongly assessed words: nationalism, socialism, and nth. Though I think it is probably overkill for your question, it was a fun side project. Thanks. : )

POSTED BY: Mark Greenberg

Jay Weininger

Jay Weininger, Santa Fe College

Posted 6 years ago

thank you worked on determining syllables in English famous speeches, for the staff in that department realized need for ; in many statements so as not to produce unneeded output

POSTED BY: Jay Weininger

Tomas Garza

Tomas Garza, Retired, freelance

Posted 6 years ago

There is a nice work on this subject in "Un Divisor Silábico (Spanish)" from the Wolfram Demonstrations Project http://demonstrations.wolfram.com/UnDivisorSilabicoSpanish/ Contributed by: Jaime Rangel-Mondragon, which has the drawback of being written in Spanish. He has devised a number of clever rules for syllabic decomposition of words in Spanish which work extremely well. He doesn't go farther, but I did some work (unpublished) to apply his rules to longer texts and obtain interesting (IMHO) statistics. For example, the total number of different syllables for the 86,016 words which make up the Spanish language dictionary used by Wolfram is 3,707 (including repetitions due to the presence of accents, proper names, and some words originated in foreign languages). Unfortunately, there was no hectic or even mildly enthusiastic response from specialists in the field, so I didn't pursue the matter further.

POSTED BY: Tomas Garza

Jay Weininger

Jay Weininger, Santa Fe College

Posted 6 years ago

thank you I now have a list of {nation, 3, future, 2} I just need to add the results, so looking in documentation.

POSTED BY: Jay Weininger

Mark Greenberg

Posted 6 years ago

I found that WordData[word,"Hyphenation"] usually divides words into syllables correctly, with three caveats. The word must have hyphenation info in WordData, some short 2-syllable words are returned without a hyphenation division, and some words are too short to hyphenate. WordData[#, "Hyphenation"] & /@ {"wished", "over", "of"} {Missing["NotAvailable"], {"over"}, Missing["NotAvailable"]} It is possible to overcome these issues. If you absolutely need syllable information for every word, then I recommend that you begin with WordData[word,"Hyphenation"]. Then retrieve WordData[word,"PhoneticForm"]. This gives the International Phonetic Alphabet version of the word, if it is available. The number of syllables will correspond to the number of vowel sounds in the word. In the case of words like "over", this will be more accurate. ipaVowels = {"a?", "a?", "e?", "??", "o?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "a", "æ", "e", "i", "o", "", "ø", "u", "y"}; phon = WordData[#, "PhoneticForm"] & /@ {"wished", "over", "of"}; {#, StringCount[#, ipaVowels]} & /@ phon {{"w???t", 1}, {"?o?v?", 2}, {"??v", 1}} Words that are not in WordData or have neither hyphenation nor phonetic values can be dealt with in a crude counting of likely vowel sound letter combinations. In other words, the word "Lenore" likely contains two syllables because of the arrangement of vowels, final -e usually being silent. I recommend that you check out my post from last week entitled "Computer Analysis of Poetry Part 1: Metrical Pattern" for an extensive example. I think that my approach can be improved upon even more by the use of machine learning, using WordData values for training data. Hopefully this gave you enough information to decide which approach will work for you.

POSTED BY: Mark Greenberg

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback