Message Boards Message Boards


I want to know which is the word that most often repeated

Posted 8 years ago
5 Replies
11 Total Likes

Hello to all after attempting to solve the following problem for many hours I am compelled to pause in their chores.The problem I am facing is the following, I have a pdf file and i want to know which is the word that most often repeated, of course not counting the conjunctions as ( and, but) but still don't know how to do it, I hope that someone can help me on something, left the pdf file with which I am working, thanks in advance

POSTED BY: Luis Ledesma
5 Replies
Posted 8 years ago

Marco many thanks for taking the trouble to help solve this problem, I think the way that you propose is novel at least for me that I know little of working with text in Mathematica, thank you very much for your help, on what you've said at the end on the legality of the books, i want to talk that the end is merely educational but the moderators of the community have the last word, greetings

POSTED BY: Luis Ledesma
Posted 8 years ago

: Bill Simpson, many thanks for answering quickly, your tips i were of great utility, i thought i could find a solution as follows

imp = Import["C:\\Users\\bullito\\Documents\\7_5150.pdf", "Plaintext"];
tokens = StringSplit[imp];

and finally

Sort[Select[Tally[tokens],100 <= #[[2]] <= 5000 &], #1[[2]] < #2[[2]] &]

that returns me what you have told me and I already search for words that I'm interested in these data, thanks a lot.

David Gathercole, I think the way that you propose is something similar, but has many more specifications which helps to understand more about how to work with text in Mathematica,thank you for help me to answer my questions, in addition thank you for commenting on what you do with each command that is very useful for understanding the result obtained, thanks again.

POSTED BY: Luis Ledesma

Hi everyone,

if it was English the function DeleteStopwords would be quite handy. But luckily it is easy to make your own stop word function. So, David, has done all the hard work. (I am not contributing much, but had some time to spare...) So, like David, I import the words.

text = Import["~/Desktop/7_5150.pdf", "Plaintext"];

with TextWords, I get a list of the words:

wordlist = TextWords[text]; 

Problem is I don't have stop words, i.e. "useless" or content-free small words. Luckily there are hundreds of websites with lists of them. This commands imports a list of these words from a website I found:

stopwordsspanish = TextWords@Import[""];

Here is David's tally:

Reverse@SortBy[Tally[Select[ToLowerCase[wordlist], ! StringMatchQ[#, stopwordsspanish] &]], #[[2]] &]

The word cloud looks like this:

WordCloud[Select[ToLowerCase[wordlist], ! StringMatchQ[#, stopwordsspanish] &]]

enter image description here

Here's a BarChart of that:

BarChart[(Reverse@SortBy[Tally[Select[ToLowerCase[wordlist], !StringMatchQ[#, stopwordsspanish] &]], #[[2]] &])[[1 ;; 20, 2]], ChartLabels -> Evaluate[Rotate[#, Pi/2] & /@ (Reverse@SortBy[Tally[Select[ToLowerCase[wordlist], ! StringMatchQ[#,stopwordsspanish] &]], #[[2]] &])[[1 ;; 20, 1]]]]

enter image description here

Looking at this, it appears to be the most inefficient way to program it; i.e. I do all the heavy StringMatching and Tally twice, but luckily the programmers at Wolfram were taking care of making the code fast so that I can be a bit lazy...

Ok, this was a minimal contribution, but I hope it helps with the stop word problem.



PS: Is it actually legal to post the pdf of that book online?

POSTED BY: Marco Thiel

A basic start is as follows.

Word tally capture

Import "Plaintext" takes only the text data from the Pdf. As Bill outlines this can be very variable, but in this instance seems to be sufficient. StringSplit removes the whitespace. Tally counts occurrences of the resulting words. Finally TakeWhile shortens the list to words which occur at least 100 times.

This whitespace division can leave punctuation or capitalisation attached to words, so a function that reduces strings to their lowercase letters alone could be used to further improve results:

word reduce

POSTED BY: David Gathercole
Posted 8 years ago

Hint: There are free online services that convert pdf files to txt files.

Then you might contemplate whether or how you might use the Mathematica functions StringSplit and Tally.

Then you might want to think of a way to easily extract the information you desire from those. Perhaps looking at the information hidden behind Details in the help page for Sort might be interesting.

If I have made no mistake then 'de' appears the most with 4409 times. I understand that is one of the words that you are not looking for.

After all this you might consider whether 'de' and 'De' should be considered the same or not, whether a word followed by a '.' or ',' is being treated the same as that word without the trailing punctuation and if that is resulting in incorrect counting, etc.

But these hints should get you started.

POSTED BY: Bill Simpson
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract