Group Abstract Group Abstract

Message Boards Message Boards

Mathematica code to extract tabulated data after conversion from pdf to text

Posted 10 months ago
4 Replies

In the extract below, I've stripped it down to just looking for the entries under "Item Descriptions" or something similar and "Sub Total" or something similar. I'm using classify with some training data because the code is to be run on thousands of invoices which can have slightly different variations of the same search strings. Attached are 3 example invoices I downloaded from the internet.

Posted 10 months ago

I have two suggestions. First, I'd change your classifier strategy. Rather than do a classifier for each heading that amounts to just true/false (effectively), I'd create a single classifier that classifies strings into specific types of headers with a sort of "discard" or "ignore" class. I'd also just coerce everything to lower case (or upper case or whatever normalization you want). I might also trim colons or other decorations. So, a sample of the classification data might look like this:

{"description" -> "Heading:ItemDescription", "product name" -> "Heading:ItemDescription", "nett total" -> "Heading:Subtotal", "sub-total" -> "Heading:Subtotal", "total remittance" -> "Ignore", "invoice total" -> "Ignore"}

Then, rather perform multiple loops each based on one classifier, you can perform the logic for each class found by the one classifier.

Second, to analyze both horizontally and vertically formatted data, I'd suggest that you use the extended form of TextRecognize to get properties for each match. Specifically, get the BoundingBox. This would be for the OCR logic. Something like

TextRecognize[...page image..., "Line", {"Text", "BoundingBox"}]

You'll get a bunch of entries like

{"ITEMS", Rectangle[{591, 1899}, {685, 1944}]}

Now, if you find your target headings lined up horizontally, then you process the invoice items as rows. If you find your target headings lined up vertically, then you process the invoice items as columns. If there are cases where things are arranged some other way, then you'll need to special-case those, I guess. You'll probably need to allow for some fuzziness in the alignment of the rectangle, but basically you test if the centers of the rectangles form a mostly horizontal line or a mostly vertical line.

POSTED BY: Eric Rimbey
Posted 10 months ago

There is a lot of extraneous code in your question. Could you pare this down to just the essentials? And I'm not convinced that you need Classify for this, because you seem to just need to look for specific labels. Also, you seem to be using an external too to convert pdf to text, but Mathematica can do that for you, so is there some reason why you need that tool?

I don't think you're going to get a definitive answer to this. I think this is one of those things where you need to keep adding test cases (and the code to handle them) as you encounter new invoice formats. But frankly, that seems easier than what you're trying to do now (if I even understand it).

POSTED BY: Eric Rimbey
Posted 10 months ago

Can you provide a couple sample PDFs?

POSTED BY: Eric Rimbey
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard