Message Boards Message Boards

0
|
10136 Views
|
2 Replies
|
4 Total Likes
View groups...
Share
Share this post:

How does one extract "Word Frequency History"?

Posted 10 years ago

Hi - Again a warning: Beginner.

I would like to accomplish two goals:

  1. Get a clear example of how to get the Word Frequency History for a particular list of words (the words come from a list) for a range of dates ? The output would be the data. I do not want the pod from Wolfram|Alpha. Just the data for analysis. I have gone though the "People and History Page" where one is directed to the WordData function.
  2. Is it possible create a function that generates x words with a positive slope (according to frequency) and another with a negative slope... Basically a measure of relevancy?

Lastly, it was not particularly clear to me (after some research) where the word frequency data comes from. The definitions I know come from wordnet.

Thank you in advance.

POSTED BY: Itay Livni
2 Replies
Posted 10 years ago

Thanks Kyle - This is a great answer although I was naively hoping to keep everything in Mathematica :)

Word Frequencies in Written and Spoken English: Based on the British National Corpus. Pearson ESL, 2001

This is interesting because the Wolfram|Alpha Data goes to 2007'sh and can be downloaded.

Alternatively, you can also download Google’s ngrams datasets to gain information about word frequency...

The ngram datasets are something I looked at earlier, however:

  1. I could not verify the data against Wolfram|Alpha's
  2. Some quick sanity tests did not pass (not for this forum)

I was very much led astray by a crumb in the People & History Reference Guide

Very helpful!

POSTED BY: Itay Livni

The first question that you asked turns out to be more complicated than I think you expected. It appears that there is no word frequency data available through Mathematica, evaluate the following function in Mathematica to see the list of properties that WordData will give access to: WordData[“Properties”]. So, there is no simple explanation that someone can give you, below I offer a suggestion about how I would go about accomplishing this task.

Here is how to figure out where the info comes from in Wolfram|Alpha. At the bottom left of a Wolfram|Alpha page, such as this one, there is a link called “sources”. If you click “sources”, then you will see another link called “word data”. If you click “word data”, then you will see the following citation: “Leech, G., P. Rayson, and A. Wilson. Word Frequencies in Written and Spoken English: Based on the British National Corpus. Pearson ESL, 2001.” Following that citation brings you to this page. which contains the datasets that are presumably used to present the information in Wolfram|Alpha. You can download those datasets and import them into Mathematica using Import[], but it seems that they are not built into the Wolfram Language at this time.

Alternatively, you can also download Google’s ngrams datasets to gain information about word frequency:.

Once you have one of these datasets imported into Mathematica, then you can certainly write a function to search for words with increasing usage and decreasing usage. This is the more complex task of picking trends from a noisy dataset (moving average, linear regression, …), but Mathematica is great for this type of work. See the documentation on Statistical Data Analysis for a good starting point. You can also use Financial Data functions such as DateListPlot[] and LinearModelFit[] to figure out whether a word frequency is trending up or down.

Hope this helps :)

POSTED BY: Kyle Keane
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract