Group Abstract

Message Boards

WOLFRAM COMMUNITY

8.6K Views

5 Replies

2 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Perform queries remotely to reduce data size?

Daniel Janzon

Posted 7 years ago

Hello, I'd like to produce a list of the most (un)common words in French. There is WordList and for a list of words, I can get the frequency for each word in a list of words via WordFrequencyData. But it is highly inefficient to get the total list of words to my computer, then get their frequency only to sort them on frequency and through away almost all the data except the top/bottom ten items. Indeed, a naive approach times out every time: words = WordList["CommonWords", Language -> "French"] wordFreq = Take[WordFrequencyData[words, Language -> "French"], 10] ... WordFrequencyData::timeout: A network operation for WordFrequencyData timed out. Please try again later. Dividing the queries into smaller partitions, wordsPartitions=Partition[words,100]; Join[Map[WordFrequencyData[#,Language->"French"]&,wordsPartitions]] takes many hours and with many failures along the way, so retry logic is also necessary to add. Hence, back to the title, is there a better way? If this was a remote SQL database, I would be able to select words, order by frequency and set limit to 10. Not much data would have to travel over the API. (Btw it's not even much data in the first place, it's incredibly slow for a commercial product.)

POSTED BY: Daniel Janzon

5 Replies

Sort By:

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 6 years ago

A while ago, I wrote a package that might help you with this problem: lazyLists https://github.com/ssmit1986/lazyLists The idea of a lazy list is that it only generates elements on request. The full documentation is in the example notebook in the repo, but here's an example of how to deal with your particular problem: Load the package and get all words: Needs["lazyLists`"] words = WordList["CommonWords", Language -> "French"]; Partition the words into chunks of length 100 (the `Hold` wrapper is used to indicate that the list is stored in the symbol `words` and works as a kind of pointer; it prevents the whole list from being copied over all the time): partitionedWords = lazyPartition[Hold[words], 100]; Map `WordFrequencyData` over the list. The `{fun, Listable}` notation tells the package that the function can be applied to an entire chuck with needing to `Map` the function over the chunck. I convert the output of `WordFrequencyData` into a list with `Normal`, because the the package doesn't work with `Associations`: wordFreq = Map[ {Normal @ WordFrequencyData[#, Language -> "French"] &, Listable}, partitionedWords ]; Get the frequency data for the first 1000: result = Take[wordFreq, 1000]; Most[result] `Most` returns the 1000 elements that have been extracted. `Last` can be used to get the "tail" of the list and continue evaluating where you left off. E.g., get the next 10 results: Most@Take[Last[result], 10] You can also use `Take[..., All]` to go through the whole word list. Hope this helps!

POSTED BY: Sjoerd Smit

Scott Stiffler

Posted 6 years ago

Hi Sjoerd, Thanks for the lead. I will need to digest what you have provided, but it is incredibly helpful to have something that works and works efficiently. Thanks. Scott

POSTED BY: Scott Stiffler

Daniel Janzon

Posted 6 years ago

Hello Scott, No I was not able to solve it. I added some retry logic to recover errors and as far as I can remember it worked in the end to get the desired list of most un/common words, but I had to let it run over night. Do let me know if you find out something new. Meanwhile I have given up on Mathematica, I can't use it for what I bought it for. If data is not missing it is too slow.

POSTED BY: Daniel Janzon

Scott Stiffler

Posted 6 years ago

Daniel, Thanks for your reply. The Wolfram Language does some very clever things so easily, that I was shocked that it was so clumsy and slow with what I thought would be an elementary exercise. I am not quite ready to give up on Wolfram yet. I have been tinkering around with logic to break the query down and recover from failures, but I haven't gotten there yet. I see that someone has posted their solution, so I plan to dig into that. Hopefully, I will "get" the style of this language and have some fun with it. Thanks again. Scott

POSTED BY: Scott Stiffler

Scott Stiffler

Posted 6 years ago

Hi Daniel, Did you ever come up with a satisfactory solution to your problem? I am struggling with the same issue. I have read Wolfram's introductory book on the language, worked through a lot of online tutorials, and made web searches, but I found myself completely stuck. Please let me know. Thanks. Scott

POSTED BY: Scott Stiffler

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback