Message Boards Message Boards

0
|
4360 Views
|
4 Replies
|
2 Total Likes
View groups...
Share
Share this post:

TXT or PDF data processing ?

Hi, I am a beginner in Wolfram Language. I am wondering if it is possible to manipulate data of my own? Or perhaps PDF or even Bible and create my own set of attribute to manipulate?

POSTED BY: Jimmy Gunawan
4 Replies

Dear All,

here is a little analysis of the main words in the bible (because that was an example mentioned in the original post):

bibleTxt = Import["http://www.gutenberg.org/cache/epub/10/pg10.txt"];
WordCloud[DeleteStopwords[TextWords[bibleTxt]]]

enter image description here

Regarding the comment above about the problems with pdf files: I am working on a project where I need to analyse millions of pdf files scanned all over the world. There are in fact many of those which Mathematica cannot open. I found that converting them on a command line to ps and then back to pdf works usually very well. I usually work on Linux based machines, where this is no problem. But this also works on Windows if you use for example cygwin. Using that procedure the pdf problems virtually vanish.

Once the pdfs are fixed there is no problem. For example on this page the author makes -with permission of the AMS - a pdf file of a good book on ODEs available for personal use. Exactly the same code as above works for the analysis of this pdf file:

odestxt = Import["http://www.mat.univie.ac.at/~gerald/ftp/book-ode/ode.pdf", "Plaintext"];
WordCloud[DeleteStopwords[TextWords[odestxt]]]

enter image description here

Of course you might want to do some additional preprocessing of the textiles, but I think that this shows how extremely well Mathematica copes with different file formats.

Cheers,

Marco

PS: I strongly recommend downloading the pdf of the ODE book for personal use. As I said it is a good book. PPS: Note that the full command to download the bible analyse the text and make the word cloud easily fits into a tweet:

WordCloud[DeleteStopwords[TextWords[Import["http://www.gutenberg.org/cache/epub/10/pg10.txt"]]],IgnoreCase->True]

has 112 characters or so you could tweet it to Wolfram's tweet-a-program section.

POSTED BY: Marco Thiel

Dear Marco, I was having a similar problem - and now you solved it! Thank you very much for your great posts!

Best regards -- Henrik

POSTED BY: Henrik Schachner

Please keep in mind related discussion: Tools and tutorials for a Data Mining beginner?

POSTED BY: EDITORIAL BOARD

Hi Jimmy, I think the wolfram language is very strong at manipulating data. So I would say a safe bet to spend some money on a home edition. For manipulating PDF's I presume you mean Adobe PDF and some caution is needed. The PDF import had not much attention and for me only 50% of the PDF's I would like to import don't work. I mainly use other tools to convert it to TXT and then I'm fine (most of the time).

POSTED BY: l van Veen
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract