Group Abstract

Message Boards

WOLFRAM COMMUNITY

9.7K Views

5 Replies

11 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

I want to know which is the word that most often repeated

Luis Ledesma

Posted 10 years ago

Hello to all after attempting to solve the following problem for many hours I am compelled to pause in their chores.The problem I am facing is the following, I have a pdf file and i want to know which is the word that most often repeated, of course not counting the conjunctions as ( and, but) but still don't know how to do it, I hope that someone can help me on something, left the pdf file with which I am working, thanks in advance Attachments:

POSTED BY: Luis Ledesma

5 Replies

Sort By:

Luis Ledesma

Posted 10 years ago

Marco many thanks for taking the trouble to help solve this problem, I think the way that you propose is novel at least for me that I know little of working with text in Mathematica, thank you very much for your help, on what you've said at the end on the legality of the books, i want to talk that the end is merely educational but the moderators of the community have the last word, greetings

POSTED BY: Luis Ledesma

Luis Ledesma

Posted 10 years ago

: Bill Simpson, many thanks for answering quickly, your tips i were of great utility, i thought i could find a solution as follows imp = Import["C:\\Users\\bullito\\Documents\\7_5150.pdf", "Plaintext"]; tokens = StringSplit[imp]; and finally Sort[Select[Tally[tokens],100 <= #[[2]] <= 5000 &], #1[[2]] < #2[[2]] &] that returns me what you have told me and I already search for words that I'm interested in these data, thanks a lot. David Gathercole, I think the way that you propose is something similar, but has many more specifications which helps to understand more about how to work with text in Mathematica,thank you for help me to answer my questions, in addition thank you for commenting on what you do with each command that is very useful for understanding the result obtained, thanks again.

POSTED BY: Luis Ledesma

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 10 years ago

Hi everyone, if it was English the function DeleteStopwords would be quite handy. But luckily it is easy to make your own stop word function. So, David, has done all the hard work. (I am not contributing much, but had some time to spare...) So, like David, I import the words. text = Import["~/Desktop/7_5150.pdf", "Plaintext"]; with TextWords, I get a list of the words: wordlist = TextWords[text]; Problem is I don't have stop words, i.e. "useless" or content-free small words. Luckily there are hundreds of websites with lists of them. This commands imports a list of these words from a website I found: stopwordsspanish = TextWords@Import["https://sites.google.com/site/kevinbouge/stopwords-lists/stopwords_es.txt?attredirects=0&d=1"]; Here is David's tally: Reverse@SortBy[Tally[Select[ToLowerCase[wordlist], ! StringMatchQ[#, stopwordsspanish] &]], #[[2]] &] The word cloud looks like this: WordCloud[Select[ToLowerCase[wordlist], ! StringMatchQ[#, stopwordsspanish] &]] Here's a BarChart of that: BarChart[(Reverse@SortBy[Tally[Select[ToLowerCase[wordlist], !StringMatchQ[#, stopwordsspanish] &]], #[[2]] &])[[1 ;; 20, 2]], ChartLabels -> Evaluate[Rotate[#, Pi/2] & /@ (Reverse@SortBy[Tally[Select[ToLowerCase[wordlist], ! StringMatchQ[#,stopwordsspanish] &]], #[[2]] &])[[1 ;; 20, 1]]]] Looking at this, it appears to be the most inefficient way to program it; i.e. I do all the heavy StringMatching and Tally twice, but luckily the programmers at Wolfram were taking care of making the code fast so that I can be a bit lazy... Ok, this was a minimal contribution, but I hope it helps with the stop word problem. Cheers, Marco PS: Is it actually legal to post the pdf of that book online?

POSTED BY: Marco Thiel

David Gathercole

David Gathercole, Vitol

Posted 10 years ago

A basic start is as follows. Import "Plaintext" takes only the text data from the Pdf. As Bill outlines this can be very variable, but in this instance seems to be sufficient. StringSplit removes the whitespace. Tally counts occurrences of the resulting words. Finally TakeWhile shortens the list to words which occur at least 100 times. This whitespace division can leave punctuation or capitalisation attached to words, so a function that reduces strings to their lowercase letters alone could be used to further improve results:

POSTED BY: David Gathercole

Bill Simpson

Posted 10 years ago

Hint: There are free online services that convert pdf files to txt files. Then you might contemplate whether or how you might use the Mathematica functions StringSplit and Tally. Then you might want to think of a way to easily extract the information you desire from those. Perhaps looking at the information hidden behind Details in the help page for Sort might be interesting. If I have made no mistake then 'de' appears the most with 4409 times. I understand that is one of the words that you are not looking for. After all this you might consider whether 'de' and 'De' should be considered the same or not, whether a word followed by a '.' or ',' is being treated the same as that word without the trailing punctuation and if that is resulting in incorrect counting, etc. But these hints should get you started.

POSTED BY: Bill Simpson

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback