Group Abstract

Message Boards

WOLFRAM COMMUNITY

3.6K Views

3 Replies

1 Total Like

View groups...

Follow this post

Share this post:

GROUPS:

String matching and extraction from pdf files

Chuck DeCarlucci

Chuck DeCarlucci, Retired

Posted 2 years ago

I have many PDF files containing date records which can be imported and result in a list like the one shown in the example Notebook attached. I am interested in extracting a list of the first year fields: {2019, 2023}. All the StringExpressions I have built extract the second year. I have tried qualifying the matching pattern with Shortest and other modifications but always extract the list {"2020", "2023"}. Can someone build a StringExpression that extracts {2019, 2023} or suggest an alternate manner of extracting the first year field? Note the exact position of the year is not known because cities may be of varying word length. The year needs to be extracted based on the text and date contents. Some general guidance or references as to how one extracts the exact fields one wants from a text expression would also be helpful. My general approach is to take an example text string I wish to parse and replace each field with a corresponding matching StringExpression which leads to hit and miss results. Example Notebook.

POSTED BY: Chuck DeCarlucci

3 Replies

Sort By:

Rohit Namjoshi

Posted 2 years ago

Not as robust as Eric's answer StringCases[txt, Repeated[DigitCharacter, {4}]][[All, 1]]

POSTED BY: Rohit Namjoshi

Chuck DeCarlucci

Chuck DeCarlucci, Retired

Posted 2 years ago

Very nice. Thank You. I overlooked TextCases as it is experimental and I have a difficult time picking the correct Mathematica function for the job. Mathematica has a lot of similar functions that do slightly different things and allow varying arguments. I tried using Find for text parsing but Find does not seem to allow Stringexpressions.

POSTED BY: Chuck DeCarlucci

Eric Rimbey

Posted 2 years ago

There is a nifty function called `TextCases`. One way to use it is to find text that can be interpreted as an entity type. reportData = {"Los Angeles REPORT November 30, 2019 - December 31, 2020", "Chicago REPORT February 01, 2023 - February 28, 2023"}; TextCases[reportData, "Date"] `{{"November 30, 2019", "December 31, 2020"}, {"February 01, 2023", "February 28, 2023"}}` You can extract a "property" for the found cases: TextCases[reportData, "Date" -> "Interpretation"] The result now will be data entities. Since you only want the first instance for each string, you can use the optional third argument: startDates = TextCases[reportData, "Date" -> "Interpretation", 1] Finally, you can use DateValue to extract the year: DateValue[#, "Year"] & /@ startDates `{{2019}, {2023}}` You can bundle this up into a function: ExtractFirstDate[string_String] := With[ {first = TextCases[string, "Date" -> "Interpretation", 1]}, If[{} == first, Missing["No dates found"], DateValue[first[[1]], "Year"]]] ...and apply it: ExtractFirstDate /@ reportData `{2019, 2023}` (Note, these are integers.) If you don't want to use TextCases or Date-specific functions, you could just use string pattern matching: ExtractFirstDate2[string_String] := With[ {first = StringCases[string, WordCharacter .. ~~ Whitespace ~~ DigitCharacter .. ~~ "," ~~ Whitespace ~~ DigitCharacter .., 1]}, If[{} == first, Missing["No dates found"], StringTrim@Last@StringSplit[first[[1]], ","]]] ...and ExtractFirstDate2 /@ reportData `{"2019", "2023"}` (Note, these are strings.)

There is a nifty function called TextCases. One way to use it is to find text that can be interpreted as an entity type.

reportData = {"Los Angeles REPORT November 30, 2019 - December 31, 2020", "Chicago REPORT February 01, 2023 - February 28, 2023"};
TextCases[reportData, "Date"]

{{"November 30, 2019", "December 31, 2020"}, {"February 01, 2023", "February 28, 2023"}}

You can extract a "property" for the found cases:

TextCases[reportData, "Date" -> "Interpretation"]

The result now will be data entities. Since you only want the first instance for each string, you can use the optional third argument:

startDates = TextCases[reportData, "Date" -> "Interpretation", 1]

Finally, you can use DateValue to extract the year:

DateValue[#, "Year"] & /@ startDates

{{2019}, {2023}}

You can bundle this up into a function:

ExtractFirstDate[string_String] :=
 With[
  {first = TextCases[string, "Date" -> "Interpretation", 1]},
  If[{} == first, Missing["No dates found"], 
   DateValue[first[[1]], "Year"]]]

...and apply it:

ExtractFirstDate /@ reportData

{2019, 2023} (Note, these are integers.)

If you don't want to use TextCases or Date-specific functions, you could just use string pattern matching:

ExtractFirstDate2[string_String] :=
 With[
  {first = 
    StringCases[string, 
     WordCharacter .. ~~ Whitespace ~~ DigitCharacter .. ~~ "," ~~ 
      Whitespace ~~ DigitCharacter .., 1]},
  If[{} == first, Missing["No dates found"], 
   StringTrim@Last@StringSplit[first[[1]], ","]]]

...and

ExtractFirstDate2 /@ reportData

{"2019", "2023"} (Note, these are strings.)

POSTED BY: Eric Rimbey

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback