Message Boards Message Boards

Perform queries remotely to reduce data size?

GROUPS:

Hello,

I'd like to produce a list of the most (un)common words in French. There is WordList and for a list of words, I can get the frequency for each word in a list of words via WordFrequencyData. But it is highly inefficient to get the total list of words to my computer, then get their frequency only to sort them on frequency and through away almost all the data except the top/bottom ten items.

Indeed, a naive approach times out every time:

words = WordList["CommonWords", Language -> "French"]
wordFreq = Take[WordFrequencyData[words, Language -> "French"], 10]
... WordFrequencyData::timeout: A network operation for WordFrequencyData timed out. Please try again later.

Dividing the queries into smaller partitions,

wordsPartitions=Partition[words,100];
Join[Map[WordFrequencyData[#,Language->"French"]&,wordsPartitions]]

takes many hours and with many failures along the way, so retry logic is also necessary to add. Hence, back to the title, is there a better way? If this was a remote SQL database, I would be able to select words, order by frequency and set limit to 10. Not much data would have to travel over the API. (Btw it's not even much data in the first place, it's incredibly slow for a commercial product.)

POSTED BY: Daniel Janzon
Answer
14 days ago

Group Abstract Group Abstract