Group Abstract

Message Boards

WOLFRAM COMMUNITY

0

|

4.7K Views

|

4 Replies

|

2 Total Likes

View groups...

Follow this post

Share

Share this post:

GROUPS:

Data Science Import and Export Wolfram Language

TXT or PDF data processing ?

Jimmy Gunawan, University of Technology of Sydney

Posted 10 years ago

Hi, I am a beginner in Wolfram Language. I am wondering if it is possible to manipulate data of my own? Or perhaps PDF or even Bible and create my own set of attribute to manipulate?

POSTED BY: Jimmy Gunawan

4 Replies

Sort By:

2

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 10 years ago

Dear All, here is a little analysis of the main words in the bible (because that was an example mentioned in the original post): bibleTxt = Import["http://www.gutenberg.org/cache/epub/10/pg10.txt"]; WordCloud[DeleteStopwords[TextWords[bibleTxt]]] Regarding the comment above about the problems with pdf files: I am working on a project where I need to analyse millions of pdf files scanned all over the world. There are in fact many of those which Mathematica cannot open. I found that converting them on a command line to ps and then back to pdf works usually very well. I usually work on Linux based machines, where this is no problem. But this also works on Windows if you use for example cygwin. Using that procedure the pdf problems virtually vanish. Once the pdfs are fixed there is no problem. For example on this page the author makes -with permission of the AMS - a pdf file of a good book on ODEs available for personal use. Exactly the same code as above works for the analysis of this pdf file: odestxt = Import["http://www.mat.univie.ac.at/~gerald/ftp/book-ode/ode.pdf", "Plaintext"]; WordCloud[DeleteStopwords[TextWords[odestxt]]] Of course you might want to do some additional preprocessing of the textiles, but I think that this shows how extremely well Mathematica copes with different file formats. Cheers, Marco PS: I strongly recommend downloading the pdf of the ODE book for personal use. As I said it is a good book. PPS: Note that the full command to download the bible analyse the text and make the word cloud easily fits into a tweet: WordCloud[DeleteStopwords[TextWords[Import["http://www.gutenberg.org/cache/epub/10/pg10.txt"]]],IgnoreCase->True] has 112 characters or so you could tweet it to Wolfram's tweet-a-program section.

POSTED BY: Marco Thiel

0

Henrik Schachner

Henrik Schachner, Radiation Therapy Center, Weilheim, Germany

Posted 10 years ago

Dear Marco, I was having a similar problem - and now you solved it! Thank you very much for your great posts! Best regards -- Henrik

POSTED BY: Henrik Schachner

0

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 10 years ago

Please keep in mind related discussion: Tools and tutorials for a Data Mining beginner?

POSTED BY: EDITORIAL BOARD

0

l van Veen, Hewlett-Packard Enterprise

Posted 10 years ago

Hi Jimmy, I think the wolfram language is very strong at manipulating data. So I would say a safe bet to spend some money on a home edition. For manipulating PDF's I presume you mean Adobe PDF and some caution is needed. The PDF import had not much attention and for me only 50% of the PDF's I would like to import don't work. I mainly use other tools to convert it to TXT and then I'm fine (most of the time).

POSTED BY: l van Veen

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback