Message Boards Message Boards

[✓] Parse a text with a template?

GROUPS:

Let's say I have a text that contains various interesting information in a non regular way, but the structure is known in advance. Would there be an easy solution to parse the text and return the useful information in an Association ?

For example the following short text would be parsed with the following template (I'm anticipating the answer coming next here) :

text=
"First Name: Tom (Age: 20)
Other info1:
Other info2:
Last Name: TomTom "; 

template=
"First Name: %FirstName% (Age: %Age%)
___
Last Name: %LastName% ";
POSTED BY: Faysal Aberkane
Answer
1 month ago

The code below answers the question. Here is a usage example:

parser = GetTextParser[template];
parser[text]
(*Returns <|"FirstName"->"Tom","Age"->20,"LastName"->"TomTom"|> *)

I was thinking it would be practical if Mathematica had something similar built-in. It's the inverse functionality of StringTemplate. I've found that what I've done is a basic parser of semi-structured text. Some resources for other languages can be found here: http://sonalake.com/latest/utah-open-source-semi-structured-text-parser/

This code started as a simpler prototype but I managed to generalize it so that different template patterns can be used (in this case %VariableName% and the BlankNullSequence ___ ).

What this code does is parse a template into intermediate variables, create variable names and symbols that will be used to generate Mathematica string patterns and finally define how to convert the matched strings into results.

templateVariables=
    {
       "%"~~x:(WordCharacter..)~~"%":>variable[x]
       ,
       "___"->variable["Blank"]
    };

variableName[variable[key_]]:=key;
leftTemplatePattern[variable[key_],variableSymbols_]:=
    With[{symbol = variableSymbols[key]},
       Shortest[Pattern[symbol,BlankSequence[]]]
    ];
rightTemplatePattern[variable[key_],variableSymbols_]:= 
    With[{symbol = variableSymbols[key]},
       ConvertString@symbol &
    ];

variableName[variable["Blank"]]:=Nothing;
leftTemplatePattern[variable["Blank"],variableSymbols_]:=___;
rightTemplatePattern[variable["Blank"],variableSymbols_]:=Nothing;

GetTextParser[template_]:=
    Module[
       {stringSplited,variableNames,variableSymbols,leftPatterns,rightPatterns,
       stringPattern,variables},

       stringSplited=StringSplit[template,templateVariables];

       variables=Cases[stringSplited,_variable];

       variableNames=variableName/@variables;
       variableSymbols=AssociationThread[variableNames,Symbol/@variableNames];

       leftPatterns=stringSplited/.x_variable:>leftTemplatePattern[x,variableSymbols]//Apply@StringExpression;
       rightPatterns=rightTemplatePattern[#,variableSymbols]&/@variables;
       stringPattern=leftPatterns:>Evaluate@rightPatterns;

       Function[text,
         StringCases[text,stringPattern]//
         If[Length@#==1, First@#, #]&//
         Map[If[Head@# === Function, #[], #]&]//
         If[# =!= {}, AssociationThread[variableNames,#], Association[]]&
       ]
    ];

SetAttributes[ConvertString, Listable];
ConvertString[s_String]:=
    Module[
       {simported}, 

       If[StringCount[s, "/"] > 0 || StringCount[s, "\\"] > 0, 
         s
         , 
         simported=ToExpression[s];

         Which[
          NumericQ[simported]||simported===True||simported===False, 
              simported
          , 
          ListQ[simported], 
              Map[ConvertString[ToString[#]]&, simported]
          , 
          DownValues[simported] =!= {}, 
              simported, 
          True,  (*string case*)
              s
         ]
       ]
    ];
POSTED BY: Faysal Aberkane
Answer
1 month ago

Nice idea, here's another approach, hastily written so don't be too harsh! :) Will be back later:

templateToStringCases[template_] := StringCases[ reapToStringCases @@ Reap[ StringExpression @@ StringSplit[          
    template, 
    { "___" :> BlankNullSequence[]
    , "%" ~~ name : Except["%"] .. ~~ "%" :> nameToPattern[name]
    }
]]]

reapToStringCases[  pattern_, {{resultRules__Hold}} ] := { pattern, Join[resultRules]} /. {patt_, Hold[result__]} :> (patt :> <|result|>)

nameToPattern[name_] :=       ToExpression[
     name, 
     StandardForm,        
     Function[
       symbol, 
       Sow[Hold[name -> symbol]];   Pattern[symbol, BlankSequence[]],
       HoldFirst
     ]
];

and now

templateToStringCases[template] @ text

{<|"FirstName" -> "Tom", "Age" -> "20", "LastName" -> "TomTom"|>}

POSTED BY: Kuba Podkalicki
Answer
1 month ago

A bit more complex than my version but there are some interesting ideas in your version. I'll take some time to understand it.

POSTED BY: Faysal Aberkane
Answer
1 month ago

Let me know if there are any doubts/problems. I've updated the syntax to be more readable. And added HoldFirst attribute forgotten earlier.

POSTED BY: Kuba Podkalicki
Answer
1 month ago

Group Abstract Group Abstract