Message Boards Message Boards

0
|
1899 Views
|
3 Replies
|
1 Total Likes
View groups...
Share
Share this post:

String matching and extraction from pdf files

Posted 1 year ago

I have many PDF files containing date records which can be imported and result in a list like the one shown in the example Notebook attached. I am interested in extracting a list of the first year fields: {2019, 2023}.

All the StringExpressions I have built extract the second year. I have tried qualifying the matching pattern with Shortest and other modifications but always extract the list {"2020", "2023"}.

Can someone build a StringExpression that extracts {2019, 2023} or suggest an alternate manner of extracting the first year field? Note the exact position of the year is not known because cities may be of varying word length. The year needs to be extracted based on the text and date contents.

Some general guidance or references as to how one extracts the exact fields one wants from a text expression would also be helpful. My general approach is to take an example text string I wish to parse and replace each field with a corresponding matching StringExpression which leads to hit and miss results.

Example Notebook.

POSTED BY: Chuck DeCarlucci
3 Replies

Not as robust as Eric's answer

StringCases[txt, Repeated[DigitCharacter, {4}]][[All, 1]]
POSTED BY: Rohit Namjoshi

Very nice. Thank You. I overlooked TextCases as it is experimental and I have a difficult time picking the correct Mathematica function for the job. Mathematica has a lot of similar functions that do slightly different things and allow varying arguments. I tried using Find for text parsing but Find does not seem to allow Stringexpressions.

POSTED BY: Chuck DeCarlucci
Posted 1 year ago

There is a nifty function called TextCases. One way to use it is to find text that can be interpreted as an entity type.

reportData = {"Los Angeles REPORT November 30, 2019 - December 31, 2020", "Chicago REPORT February 01, 2023 - February 28, 2023"};
TextCases[reportData, "Date"]

{{"November 30, 2019", "December 31, 2020"}, {"February 01, 2023", "February 28, 2023"}}

You can extract a "property" for the found cases:

TextCases[reportData, "Date" -> "Interpretation"]

The result now will be data entities. Since you only want the first instance for each string, you can use the optional third argument:

startDates = TextCases[reportData, "Date" -> "Interpretation", 1]

Finally, you can use DateValue to extract the year:

DateValue[#, "Year"] & /@ startDates

{{2019}, {2023}}

You can bundle this up into a function:

ExtractFirstDate[string_String] :=
 With[
  {first = TextCases[string, "Date" -> "Interpretation", 1]},
  If[{} == first, Missing["No dates found"], 
   DateValue[first[[1]], "Year"]]]

...and apply it:

ExtractFirstDate /@ reportData

{2019, 2023} (Note, these are integers.)

If you don't want to use TextCases or Date-specific functions, you could just use string pattern matching:

ExtractFirstDate2[string_String] :=
 With[
  {first = 
    StringCases[string, 
     WordCharacter .. ~~ Whitespace ~~ DigitCharacter .. ~~ "," ~~ 
      Whitespace ~~ DigitCharacter .., 1]},
  If[{} == first, Missing["No dates found"], 
   StringTrim@Last@StringSplit[first[[1]], ","]]]

...and

ExtractFirstDate2 /@ reportData

{"2019", "2023"} (Note, these are strings.)

POSTED BY: Eric Rimbey
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract