Group Abstract

Message Boards

WOLFRAM COMMUNITY

3K Views

2 Replies

1 Total Like

View groups...

Follow this post

Share this post:

GROUPS:

Image Processing Wolfram Language

Finding a simple corner in a simple PDF

Rodrigo Amor

Posted 1 year ago

Hello Wolfram community. I have been using Mathematica for a while, for different purposes, but recently I have been faced with an Image processing super simple problem and I am kind of lost. In our office we have a bunch of old documents that we need to digitalize. We do have them in PDF and the quality is quite good so Mathematica does a quite good job with TextRecognize. However, depending on the type of the page and the section of the page (Only text, text with graphics, or text in tables) TextRecognize will work better if I use RecognitionPrior as "Column" or "Block". So my dilemma is to make the bounding boxes for the document so that I can structure the data as a table later on. For example Pink box is the title, Green box the Product Description and Yellow the Products details (see image or attached pdf). My biggest challenge is the Yellow boxes as they change size and position. I want to be able to detect the Red Spots I marked so I can draw the BoundingBox I figured out that almost all tables start with the same Header Pattern, So I am getting all Pixels with that color and then checking when the color changes in X . Something like headercolor=RGB[a,b,c]; headercolorspositions=PixelValuePositions[thisimage, headercolor] DerivativeFilter[headercolorspositions] I am sure there is an easier way to do this. Like selecting the part of the image I want and then just looking for that pixel sequence inside ImageData. I feel like I am trying to kill a bug with a cannon. If you have any simple solution it will be appreciated

POSTED BY: Rodrigo Amor

2 Replies

Sort By:

Henrik Schachner

Henrik Schachner, Radiation Therapy Center, Weilheim, Germany

Posted 1 year ago

Rodrigo, here is one way: img0 = First@Import["SamplePDF.pdf"] img = ColorNegate@Nest[ColorNegate@DeleteSmallComponents, img0, 2]; bbox = ComponentMeasurements[img, "BoundingBox"]; HighlightImage[img0, bbox] Isn't Mathematica just great ?!! Regards -- Henrik Attachments:* SamplePDF.pdf

POSTED BY: Henrik Schachner

Theo Vine

Posted 1 year ago

I'm not sure whether or not this is too late or even worth your time, but it seems to me that perhaps you might be better to use Mathematica's built-in image selection tool? If you have your image in the processor, you can click on it and an array of options comes up. From there you can select the region you want, copy it and then run TextRecognize on it to get the text, from which you can then output to wherever you want. It might take a little longer than having Mathematica do it automatically, but from my experience using functions like ImageCorners is problematic when text is involved. It would probably take less time than marking the red spots and then coding something to create bounding boxes, in my opinion at least. This might not be what you're looking for, but hope it helps!

POSTED BY: Theo Vine

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback