Group Abstract

Message Boards

WOLFRAM COMMUNITY

10.7K Views

9 Replies

5 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Use TextRecognize for pdf images?

Francisco Gutierrez

Francisco Gutierrez, Universidad Nacional

Posted 6 years ago

Friends: TextRecognize is a powerful function. I am wondering if it can be used to read image pdfs (pdf is a recognized import format). So one would have to import the file, and then read it with TextRecognize. Sounds simple... I have made several tries, however, without success. Is this a wild goose chase or is there some way of doing it? Thanks Francisco

POSTED BY: Francisco Gutierrez

9 Replies

Sort By:

Shadi Ashnai

Shadi Ashnai, Wolfram Research, Inc.

Posted 6 years ago

Francisco, glad you find the workaround useful. We know `Import` is an essential gateway to the rest of our powerful language. As I said earlier, improving our PDF import is on our priority list and hopefully this conversion wouldn't be needed anymore.

POSTED BY: Shadi Ashnai

Francisco Gutierrez

Francisco Gutierrez, Universidad Nacional

Posted 6 years ago

You are right: the workaround was very useful, but ideally everything would be processed through Import. And to learn that solving this is a priority is really encouraging. Looking forward to that moment...

POSTED BY: Francisco Gutierrez

Shadi Ashnai

Shadi Ashnai, Wolfram Research, Inc.

Posted 6 years ago

Francisco, for this attached PDF, our PDF import is failing! I converted the PDF to PNG and imported it to Mathematica. Then, TextRecognize does a fine job recognizing the text. In[67]:= text = TextRecognize[images[[1]]]; Snippet[text, 10] Out[68]= "\\/ ` II ~ VIOLENCIA (correspondencia renibida) /.. Sugerencias varias. Comité de Accién Ciudadana, Bogota. Propone constitucion de gr'upos de estudlo para estudiar metodos apropiados contra la violencia. Asociacion Juridica Colombianal Bogota. Resolucién en que convoca a congreso de carécter nacional para iniciar accion solidaria contra la violencia, José J. Villafradez. Barbosa. Presidente Junta Accién Comunal de" A side note that we are aware of various PDF bugs. We will add this file to our problematic sample files and will hopefully get to improving the support soon. Attachments: AGN_Corresponden...png

Francisco, for this attached PDF, our PDF import is failing! I converted the PDF to PNG and imported it to Mathematica. Then, TextRecognize does a fine job recognizing the text.

In[67]:= text = TextRecognize[images[[1]]];
Snippet[text, 10]

Out[68]= "\\/ ` II ~ VIOLENCIA (correspondencia renibida)

/..
Sugerencias varias.
Comité de Accién Ciudadana, Bogota. Propone constitucion de gr'upos
de estudlo para estudiar metodos apropiados contra la violencia.
Asociacion Juridica Colombianal Bogota. Resolucién en que convoca a
congreso de carécter nacional para iniciar accion solidaria contra la
violencia,
José J. Villafradez. Barbosa. Presidente Junta Accién Comunal de"

A side note that we are aware of various PDF bugs. We will add this file to our problematic sample files and will hopefully get to improving the support soon.

POSTED BY: Shadi Ashnai

Francisco Gutierrez

Francisco Gutierrez, Universidad Nacional

Posted 6 years ago

Yes it does work! Two lines of code but ultra-useful. So much that I think at some moment it should make it to the documentation; manipulating pdfs --bad pdfs-- is a big issue for all the people that deal on an everyday basis with textual data. Many thanks Shadi. This is the kind of thing that, if you do not know how to do it, can produce an enormous amount of frustration and loss of time.

POSTED BY: Francisco Gutierrez

Shadi Ashnai

Shadi Ashnai, Wolfram Research, Inc.

Posted 6 years ago

It highly depends on quality of the image. For clean, high-res images TextRecognize should work just fine. With some of the new additions, you would also be able to get the location of each chunk of text to perform layout analysis, etc. However, quality of PDFs may vary a lot and I'm sure there will be cases that TextRecognize is not capable of recognizing text. Sometimes upsampling the image would help. For example, call ImageResize[image,Scaled[2]] If you give us a sample of your PDFs, we can send a more specific suggestion.

POSTED BY: Shadi Ashnai

Francisco Gutierrez

Francisco Gutierrez, Universidad Nacional

Posted 6 years ago

Many thanks for this, you are right: the way of discussing this is with a sample.I am attaching it. All the best, Francisco

POSTED BY: Francisco Gutierrez

Francisco Gutierrez

Francisco Gutierrez, Universidad Nacional

Posted 6 years ago

sorry I did not attach it to the previous post Attachments: AGN_Corresponden...pdf

POSTED BY: Francisco Gutierrez

Rohit Namjoshi

Posted 6 years ago

pdf = First[Import["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"]]; TextRecognize[pdf] (* Dummy PDF file *) If the PDF just has text then this should work. Import["file.pdf", "Plaintext"]

POSTED BY: Rohit Namjoshi

Francisco Gutierrez

Francisco Gutierrez, Universidad Nacional

Posted 6 years ago

Many thanks for this. However, it did not work. I think it is because I am dealing with pdfs that are images. In terms of difficulty to process I would describe them as "intermediate" (generally typed documents which are mainly text): neither terribly difficult nor so easy. I am attaching an example. Best, Francisco Attachments: AGN_Corresponden...pdf

POSTED BY: Francisco Gutierrez

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback