Message Boards Message Boards

0
|
8738 Views
|
9 Replies
|
5 Total Likes
View groups...
Share
Share this post:

Use TextRecognize for pdf images?

Friends: TextRecognize is a powerful function. I am wondering if it can be used to read image pdfs (pdf is a recognized import format). So one would have to import the file, and then read it with TextRecognize. Sounds simple...

I have made several tries, however, without success. Is this a wild goose chase or is there some way of doing it?

Thanks Francisco

9 Replies

Francisco, glad you find the workaround useful. We know Import is an essential gateway to the rest of our powerful language. As I said earlier, improving our PDF import is on our priority list and hopefully this conversion wouldn't be needed anymore.

POSTED BY: Shadi Ashnai

You are right: the workaround was very useful, but ideally everything would be processed through Import. And to learn that solving this is a priority is really encouraging. Looking forward to that moment...

Francisco, for this attached PDF, our PDF import is failing! I converted the PDF to PNG and imported it to Mathematica. Then, TextRecognize does a fine job recognizing the text.

In[67]:= text = TextRecognize[images[[1]]];
Snippet[text, 10]

Out[68]= "\\/ ` II ~ VIOLENCIA (correspondencia renibida)

/..
Sugerencias varias.
Comité de Accién Ciudadana, Bogota. Propone constitucion de gr'upos
de estudlo para estudiar metodos apropiados contra la violencia.
Asociacion Juridica Colombianal Bogota. Resolucién en que convoca a
congreso de carécter nacional para iniciar accion solidaria contra la
violencia,
José J. Villafradez. Barbosa. Presidente Junta Accién Comunal de"

A side note that we are aware of various PDF bugs. We will add this file to our problematic sample files and will hopefully get to improving the support soon.

Attachment

Attachments:
POSTED BY: Shadi Ashnai

Yes it does work! Two lines of code but ultra-useful. So much that I think at some moment it should make it to the documentation; manipulating pdfs --bad pdfs-- is a big issue for all the people that deal on an everyday basis with textual data. Many thanks Shadi. This is the kind of thing that, if you do not know how to do it, can produce an enormous amount of frustration and loss of time.

It highly depends on quality of the image. For clean, high-res images TextRecognize should work just fine. With some of the new additions, you would also be able to get the location of each chunk of text to perform layout analysis, etc.

However, quality of PDFs may vary a lot and I'm sure there will be cases that TextRecognize is not capable of recognizing text. Sometimes upsampling the image would help. For example, call

ImageResize[image,Scaled[2]]

If you give us a sample of your PDFs, we can send a more specific suggestion.

POSTED BY: Shadi Ashnai

Many thanks for this, you are right: the way of discussing this is with a sample.I am attaching it. All the best, Francisco

sorry I did not attach it to the previous post

Attachments:
Posted 5 years ago
pdf = First[Import["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"]];
TextRecognize[pdf]
(* Dummy PDF file *)

If the PDF just has text then this should work.

Import["file.pdf", "Plaintext"]
POSTED BY: Rohit Namjoshi

Many thanks for this. However, it did not work. I think it is because I am dealing with pdfs that are images. In terms of difficulty to process I would describe them as "intermediate" (generally typed documents which are mainly text): neither terribly difficult nor so easy. I am attaching an example. Best, Francisco

Attachments:
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract