Message Boards Message Boards

Using Prefix trees for Markov chain text generation

Posted 2 years ago

POSTED BY: Anton Antonov
4 Replies

Here is the corresponding paclet:

enter image description here

POSTED BY: Anton Antonov
Posted 2 years ago

Hi Anton, thanks for this considerable effort. It's something I've been thinking about and experimenting on lately as well. As far as validation goes, is it possible to give more rigorous assurances hitting not only the 1-gram frequency distribution but also whichever n-gram distribution is used for determining transition probabilities?

The context I'm thinking of is the Metropolis-Hastings algorithm. Obviously in this case reading text forwards is not the same as reading it backwards, so symmetric proof logic designed for "Gaussian drawn step size" does not readily apply. It seemed to me that, reading forwards only, the following naive answer would be okay:

$$A(x'|x) = P(x'|x) / g(x'|x)$$

with acceptance $A$, proposal $g$, and conditional $P$ (as on wikipedia). For optimization sake, we wouldn't want to calculate reverse-conditional $P(x|x')$ if we can get away without it. One potential issue is that, when rejecting some proposals, we can't end up with a sentence like "the cat cat cat jumped jumped over the the fence", but repeats may be necessary to achieve exact statistics.

I looked briefly on google scholar, and didn't find anything relevant. Maybe I don't have the right search terms?

POSTED BY: Brad Klee

First -- thank you for your comment!

  1. Prefix trees (tries) can be used to build both forward and backward phrases. For example, I used multiple 3-gram and 4-gram tries with frequencies in order to predict the most likely correct word in a contextual spellchecker.

  2. I am not sure do I interpret the validation question correctly -- ideally this answer is relevant:

    • My tries implementation allows getting sub-tries.
    • For example, a trie based on 4-grams:
    1. Can be queried with a 2-gram
    2. The phrases of the resulting sub-trie can be extracted together with their probabilities
    3. A top-k kind of test can be used to evaluate do these phrases overlap with a certain presumed correct set of phrases
POSTED BY: Anton Antonov

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract