Message Boards Message Boards

2 Replies
1 Total Likes
View groups...
Share this post:

Finding a simple corner in a simple PDF

Posted 22 days ago

Hello Wolfram community.

I have been using Mathematica for a while, for different purposes, but recently I have been faced with an Image processing super simple problem and I am kind of lost. In our office we have a bunch of old documents that we need to digitalize. We do have them in PDF and the quality is quite good so Mathematica does a quite good job with TextRecognize.

However, depending on the type of the page and the section of the page (Only text, text with graphics, or text in tables) TextRecognize will work better if I use RecognitionPrior as "Column" or "Block".

So my dilemma is to make the bounding boxes for the document so that I can structure the data as a table later on. For example Pink box is the title, Green box the Product Description and Yellow the Products details (see image or attached pdf). My biggest challenge is the Yellow boxes as they change size and position. I want to be able to detect the Red Spots I marked so I can draw the BoundingBox

I figured out that almost all tables start with the same Header Pattern, So I am getting all Pixels with that color and then checking when the color changes in X . Something like

headercolorspositions=PixelValuePositions[thisimage, headercolor]

I am sure there is an easier way to do this. Like selecting the part of the image I want and then just looking for that pixel sequence inside ImageData.

I feel like I am trying to kill a bug with a cannon. If you have any simple solution it will be appreciated

Desired output

POSTED BY: Rodrigo Amor
2 Replies


here is one way:

img0 = First@Import["SamplePDF.pdf"]

enter image description here

img = ColorNegate@Nest[ColorNegate@*DeleteSmallComponents, img0, 2];
bbox = ComponentMeasurements[img, "BoundingBox"];
HighlightImage[img0, bbox]

enter image description here

Isn't Mathematica just great ?!! Regards -- Henrik

POSTED BY: Henrik Schachner
Posted 19 days ago

I'm not sure whether or not this is too late or even worth your time, but it seems to me that perhaps you might be better to use Mathematica's built-in image selection tool? If you have your image in the processor, you can click on it and an array of options comes up. From there you can select the region you want, copy it and then run TextRecognize on it to get the text, from which you can then output to wherever you want. It might take a little longer than having Mathematica do it automatically, but from my experience using functions like ImageCorners is problematic when text is involved.

It would probably take less time than marking the red spots and then coding something to create bounding boxes, in my opinion at least. This might not be what you're looking for, but hope it helps!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract