Message Boards Message Boards

GROUPS:

Perform queries remotely to reduce data size?

Posted 1 year ago
964 Views
|
5 Replies
|
2 Total Likes
|

Hello,

I'd like to produce a list of the most (un)common words in French. There is WordList and for a list of words, I can get the frequency for each word in a list of words via WordFrequencyData. But it is highly inefficient to get the total list of words to my computer, then get their frequency only to sort them on frequency and through away almost all the data except the top/bottom ten items.

Indeed, a naive approach times out every time:

words = WordList["CommonWords", Language -> "French"]
wordFreq = Take[WordFrequencyData[words, Language -> "French"], 10]
... WordFrequencyData::timeout: A network operation for WordFrequencyData timed out. Please try again later.

Dividing the queries into smaller partitions,

wordsPartitions=Partition[words,100];
Join[Map[WordFrequencyData[#,Language->"French"]&,wordsPartitions]]

takes many hours and with many failures along the way, so retry logic is also necessary to add. Hence, back to the title, is there a better way? If this was a remote SQL database, I would be able to select words, order by frequency and set limit to 10. Not much data would have to travel over the API. (Btw it's not even much data in the first place, it's incredibly slow for a commercial product.)

5 Replies
Posted 3 months ago

Hi Daniel,

Did you ever come up with a satisfactory solution to your problem? I am struggling with the same issue.

I have read Wolfram's introductory book on the language, worked through a lot of online tutorials, and made web searches, but I found myself completely stuck.

Please let me know.

Thanks. Scott

Posted 3 months ago

Hello Scott, No I was not able to solve it. I added some retry logic to recover errors and as far as I can remember it worked in the end to get the desired list of most un/common words, but I had to let it run over night. Do let me know if you find out something new. Meanwhile I have given up on Mathematica, I can't use it for what I bought it for. If data is not missing it is too slow.

Posted 3 months ago

Daniel,

Thanks for your reply. The Wolfram Language does some very clever things so easily, that I was shocked that it was so clumsy and slow with what I thought would be an elementary exercise.

I am not quite ready to give up on Wolfram yet. I have been tinkering around with logic to break the query down and recover from failures, but I haven't gotten there yet.

I see that someone has posted their solution, so I plan to dig into that.

Hopefully, I will "get" the style of this language and have some fun with it.

Thanks again. Scott

A while ago, I wrote a package that might help you with this problem: lazyLists https://github.com/ssmit1986/lazyLists

The idea of a lazy list is that it only generates elements on request. The full documentation is in the example notebook in the repo, but here's an example of how to deal with your particular problem:

Load the package and get all words:

Needs["lazyLists`"]
words = WordList["CommonWords", Language -> "French"];

Partition the words into chunks of length 100 (the Hold wrapper is used to indicate that the list is stored in the symbol words and works as a kind of pointer; it prevents the whole list from being copied over all the time):

partitionedWords = lazyPartition[Hold[words], 100];

Map WordFrequencyData over the list. The {fun, Listable} notation tells the package that the function can be applied to an entire chuck with needing to Map the function over the chunck. I convert the output of WordFrequencyData into a list with Normal, because the the package doesn't work with Associations:

wordFreq = Map[
   {Normal @ WordFrequencyData[#, Language -> "French"] &, Listable},
   partitionedWords
  ];

Get the frequency data for the first 1000:

result = Take[wordFreq, 1000];
Most[result]

Most returns the 1000 elements that have been extracted. Last can be used to get the "tail" of the list and continue evaluating where you left off. E.g., get the next 10 results:

Most@Take[Last[result], 10]

You can also use Take[..., All] to go through the whole word list.

Hope this helps!

Posted 3 months ago

Hi Sjoerd,

Thanks for the lead. I will need to digest what you have provided, but it is incredibly helpful to have something that works and works efficiently.

Thanks. Scott

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract