Group Abstract

Message Boards

WOLFRAM COMMUNITY

1.8K Views

4 Replies

2 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Computer Science Data Science Image Processing Import and Export Wolfram Language Machine Learning Natural Language Processing

Mathematica code to extract tabulated data after conversion from pdf to text

Archie Watts-Farmer

Archie Watts-Farmer, Carnot Ltd

Posted 10 months ago

POSTED BY: Archie Watts-Farmer

4 Replies

Sort By:

Archie Watts-Farmer

Archie Watts-Farmer, Carnot Ltd

Posted 10 months ago

In the extract below, I've stripped it down to just looking for the entries under "Item Descriptions" or something similar and "Sub Total" or something similar. I'm using classify with some training data because the code is to be run on thousands of invoices which can have slightly different variations of the same search strings. Attached are 3 example invoices I downloaded from the internet. Attachments: invoice-1.pdf invoice-2.pdf invoice-3.pdf

POSTED BY: Archie Watts-Farmer

Eric Rimbey

Posted 10 months ago

I have two suggestions. First, I'd change your classifier strategy. Rather than do a classifier for each heading that amounts to just true/false (effectively), I'd create a single classifier that classifies strings into specific types of headers with a sort of "discard" or "ignore" class. I'd also just coerce everything to lower case (or upper case or whatever normalization you want). I might also trim colons or other decorations. So, a sample of the classification data might look like this: {"description" -> "Heading:ItemDescription", "product name" -> "Heading:ItemDescription", "nett total" -> "Heading:Subtotal", "sub-total" -> "Heading:Subtotal", "total remittance" -> "Ignore", "invoice total" -> "Ignore"} Then, rather perform multiple loops each based on one classifier, you can perform the logic for each class found by the one classifier. Second, to analyze both horizontally and vertically formatted data, I'd suggest that you use the extended form of TextRecognize to get properties for each match. Specifically, get the BoundingBox. This would be for the OCR logic. Something like TextRecognize[...page image..., "Line", {"Text", "BoundingBox"}] You'll get a bunch of entries like {"ITEMS", Rectangle[{591, 1899}, {685, 1944}]} Now, if you find your target headings lined up horizontally, then you process the invoice items as rows. If you find your target headings lined up vertically, then you process the invoice items as columns. If there are cases where things are arranged some other way, then you'll need to special-case those, I guess. You'll probably need to allow for some fuzziness in the alignment of the rectangle, but basically you test if the centers of the rectangles form a mostly horizontal line or a mostly vertical line.

POSTED BY: Eric Rimbey

Eric Rimbey

Posted 10 months ago

There is a lot of extraneous code in your question. Could you pare this down to just the essentials? And I'm not convinced that you need `Classify` for this, because you seem to just need to look for specific labels. Also, you seem to be using an external too to convert pdf to text, but Mathematica can do that for you, so is there some reason why you need that tool? I don't think you're going to get a definitive answer to this. I think this is one of those things where you need to keep adding test cases (and the code to handle them) as you encounter new invoice formats. But frankly, that seems easier than what you're trying to do now (if I even understand it).

POSTED BY: Eric Rimbey

Eric Rimbey

Posted 10 months ago

Can you provide a couple sample PDFs?

POSTED BY: Eric Rimbey

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback