Group Abstract

Message Boards

8.8K Views

4 Replies

16 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Data Science Graphics and Visualization Packages Statistics and Probability Machine Learning Natural Language Processing Wolfram Function Repository

Posted 2 years ago

POSTED BY: Anton Antonov

4 Replies

Sort By:

Posted 2 years ago

Here is the corresponding paclet:

POSTED BY: Anton Antonov

Posted 2 years ago

Hi Anton, thanks for this considerable effort. It's something I've been thinking about and experimenting on lately as well. As far as validation goes, is it possible to give more rigorous assurances hitting not only the 1-gram frequency distribution but also whichever n-gram distribution is used for determining transition probabilities? The context I'm thinking of is the Metropolis-Hastings algorithm. Obviously in this case reading text forwards is not the same as reading it backwards, so symmetric proof logic designed for "Gaussian drawn step size" does not readily apply. It seemed to me that, reading forwards only, the following naive answer would be okay: $$A(x'\|x) = P(x'\|x) / g(x'\|x)$$ with acceptance $A$, proposal $g$, and conditional $P$ (as on wikipedia). For optimization sake, we wouldn't want to calculate reverse-conditional $P(x\|x')$ if we can get away without it. One potential issue is that, when rejecting some proposals, we can't end up with a sentence like "the cat cat cat jumped jumped over the the fence", but repeats may be necessary to achieve exact statistics. I looked briefly on google scholar, and didn't find anything relevant. Maybe I don't have the right search terms?

POSTED BY: Brad Klee

Posted 2 years ago

First -- thank you for your comment! Prefix trees (tries) can be used to build both forward and backward phrases. For example, I used multiple 3-gram and 4-gram tries with frequencies in order to predict the most likely correct word in a contextual spellchecker. (Using the package "MonadicPhraseCompletion.m".) I am not sure do I interpret the validation question correctly -- ideally this answer is relevant: My tries implementation allows getting sub-tries. For example, a trie based on 4-grams: Can be queried with a 2-gram The phrases of the resulting sub-trie can be extracted together with their probabilities A top-k kind of test can be used to evaluate do these phrases overlap with a certain presumed correct set of phrases

POSTED BY: Anton Antonov

Posted 2 years ago

-- you have earned *Featured Contributor Badge* Your exceptional post has been selected for our editorial column *Staff Picks* http://wolfr.am/StaffPicks and Your Profile is now distinguished by a *Featured Contributor Badge* and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback