Message Boards Message Boards

0
|
7402 Views
|
18 Replies
|
3 Total Likes
View groups...
Share
Share this post:

[?] Find a substring between brackets?

Posted 7 years ago

Hello all

I would like to extract a substring between two brackets:

string1 = "abcde(fgh{ij}k)lmn"
pos1 = First[
  StringPosition[string1, "e(" ~~ RegularExpression["[^)]*"] ~~ ")"]]
StringTake[string1, pos1 + {2, -1}]

Result:

fgh{ij}k

This works pretty well so far. But if there other round brackets between this outer brackets, it doesnt find the correct closing bracket this way:

string2 = "abcde(fgh(ij)k)lmn"
pos2 = First[
  StringPosition[string2, "e(" ~~ RegularExpression["[^)]*"] ~~ ")"]]
StringTake[string1, pos2 + {2, -1}]

Result:

fgh(ij

What is a good way to solve this problem?

Thanky ou for any help and kind regards!

POSTED BY: Juerg Baertsch
18 Replies
Anonymous User
Anonymous User
Posted 7 years ago

This works pretty well so far. But if there other round brackets between this outer brackets, it doesnt find the correct closing bracket this way:

Yes you have to count opening and closing if you need to walk through the string using one loop. You may not have to, you likely have other options.

The speedy simple way for you to do is this: scan the string with a SHORT loop (the inside of the loop has the least number of operations) to find the first (. scan the loop again to find the last. that's all! in ASM language you'd search for the last using reverse direction cpu instruction. an alternate way is to regex the first, regex the last, then analyze the intersection of the matches (worst way?). If "moderately fast" is ok, you can use Position[] or Cases[] and simply take the first and last positions found of each of {), which i think the first responder to the post was trying to suggest.

With a parser you could match () as un-ambiguously parsed expressions (underline un-ambiguous). You have to study parsers to understand what that means: if you don't - you'll end up with code that makes mistakes you handn't imagined were possible. You can consider Position and Cases as un-ambiguous parsers but that their ability to match patterns is limited (they are specialized for matching parsed expressions - which if you learn to use them is good enough for much general parsing work).

POSTED BY: Anonymous User
Posted 7 years ago

Hi John

if you'd written a parser before you'd know what your asking isn't possible: meaning - yes there is no magic for bracket counting :)

On a homepage I found the following: enter image description here

Its not for Mathematica and I dont know if Mathematica can do this, but this was the thing I was looking for first.

But anyway, Neil showed in his posts above a solution, that exactly does what I wanted.

POSTED BY: Juerg Baertsch

Juerg,

First, there is a bug in my code -- I needed to add an "UpTo[]" in the stringTake to handle edge conditions:

findOption4[text_String, option_String] := 
 Module[{possible, opencnt, closecnt, cutoff, 
   subtext = 
    StringDrop[text, StringPosition[text, option ~~ "{"][[1, 2]]]}, 
  cutoff = 100; 
  possible = 
   StringCases[StringTake[subtext, UpTo[cutoff]], 
    StartOfString ~~ x___ ~~ "}" -> x, Overlaps -> All]; 
  opencnt = StringCount[possible, "{"] - StringCount[possible, "\{"];
  closecnt = 
   StringCount[possible, "}"] - StringCount[possible, "\}"];
  option <> "{" <> 
   possible[[Position[opencnt - closecnt, 0][[-1, 1]]]] <> "}"]

The timing test is not representative because the data is always the same and the result is only 4 characters. To generate a random test data set we can do something like this (and check to delete any lines that are too long for the functions)

randTexts = Table["dummy{" <>  StringJoin[Table["{" <> RandomWord[] <> "}", RandomInteger[{1, 10}]]] <>  "}", 1000];
nrandTexts = DeleteCases[randTexts, x_ /; StringLength[x] > 105];
longtexts = Map[
   FromCharacterCode[
      RandomInteger[{97, 122}, RandomInteger[{20, 200}]]] <> # <> 
     FromCharacterCode[
      RandomInteger[{97, 122}, RandomInteger[{20, 200}]]] &, 
   nrandTexts];

As an aside, I tried this with RandomWord[Language->"German"] but only 65% of the strings were short enough to use-- German uses many characters!

Generating the random words takes a long time because it uses the internet but now we can time things

In[100]:= Timing[findOption5[#, "dummy"] & /@ longtexts;]

Out[100]= {0.186421, Null}
In[101]:= Timing[findOption4[#, "dummy"] & /@ longtexts;]

Out[101]= {0.053151, Null}

Certainly you have more control processing strings character by character. At first glance most people would expect your code to be faster because it handles far fewer characters than my code does. The takeaway is that MMA is very inefficient at looping. If you could compile your function (MMA does not compile functions with string arguments) I would expect it to be much faster. (as an exercise you might want to turn your strings into character codes and process the strings as integers and compile the function -- I expect that would be VERY fast.)

This was an interesting example.

Regards,

Neil

POSTED BY: Neil Singer
Posted 7 years ago

Yes, this was really interessting for me and I could learn lot of new things, thank you Neil!

POSTED BY: Juerg Baertsch

Juerg,

Generally, it is better to avoid looping-- there is almost always a list-based approach that is faster. I was not aware that the text strings would be very large. I assumed that since the options are repeating over and over that you would be breaking up the file somehow before running this code snippet. In the case that you are looking for only one option in a large file, the looping code may be a better way.

However, If you can safely assume that there will never be more than 5 or 10 or 15 close brackets in a single option (which seems safe -- or alternatively limit the number of characters in an option), you could modify my list-based routine by cutting the search off at a maximum of, say in this example, 10 brackets. By doing this you would get a significant speedup over the looping code.

Here is a version that only looks 10-deep and runs about 4 times faster than the looping version (I read in your file as text):

findOption3[text_String, option_String] := 
 Module[{possible, opencnt, closecnt, cutoff, 
   subtext = StringDrop[text, StringPosition[text, option ~~ "{"][[1, 2]]]}, 
  cutoff = Last[ StringPosition[subtext, "}", 10]][[1]]; 
  possible = StringCases[StringTake[subtext, cutoff], StartOfString ~~ x___ ~~ "}" -> x, Overlaps -> All]; 
  opencnt = StringCount[possible, "{"] - StringCount[possible, "\{"];
  closecnt =  StringCount[possible, "}"] - StringCount[possible, "\}"];
  option <> "{" <>  possible[[Position[opencnt - closecnt, 0][[-1, 1]]]] <> "}"]

In[30]:= Timing[findOption[text, "\\def\\opa"];]

Out[30]= {0.0056, Null}

In[31]:= Timing[findOption3[text, "\\def\\opa"];]

Out[31]= {0.00156, Null}

Here is a version that allows only 100 characters in an option -- it runs even faster

findOption4[text_String, option_String] := 
 Module[{possible, opencnt, closecnt, cutoff, 
   subtext =  StringDrop[text, StringPosition[text, option ~~ "{"][[1, 2]]]}, 
  cutoff = 100; 
  possible = StringCases[StringTake[subtext, cutoff], StartOfString ~~ x___ ~~ "}" -> x, Overlaps -> All]; 
  opencnt = StringCount[possible, "{"] - StringCount[possible, "\{"];
  closecnt = StringCount[possible, "}"] - StringCount[possible, "\}"];
  option <> "{" <> possible[[Position[opencnt - closecnt, 0][[-1, 1]]]] <> "}"]

In[38]:= Timing[findOption4[text, "\\def\\opa"];]

Out[38]= {0.001267, Null}
POSTED BY: Neil Singer
Posted 7 years ago

The restriction to 10-deep or 100 charcters is practical, there are no options the would be affected by this restriction for sure.

I tried to make a cutoff with 100 chars in my function too, just for fun ;)

findOption5[text_String, option_String] := 
 Module[{pos, chars, sum, i}, 
  pos = StringPosition[text, option <> "{"];
  If[Length[pos] == 0, Return[False], pos = First[pos]];
  chars = 
   Characters[
    StringTake[
     text, {pos[[2]] + 1, Min[pos[[2]] + 100, StringLength[text]]}]];
  sum = 0; i = 1;
  While[sum != -1 && i <= Length[chars], 
   Which[i > 1 && chars[[i - 1]] == "\\", , chars[[i]] == "{", 
    sum = sum + 1, chars[[i]] == "}", sum = sum - 1];
   i++];
  Return[pos + {0, i}]]

The times:

In[37]:= Timing[Table[findOption[text, "\\def\\opa"], {1000}];]

Out[37]= {5.71875, Null}

In[38]:= Timing[Table[findOption2[text, "\\def\\opa"], {1000}];]

Out[38]= $Aborted

In[39]:= Timing[Table[findOption3[text, "\\def\\opa"], {1000}];]

Out[39]= {1.32813, Null}

In[46]:= Timing[Table[findOption4[text, "\\def\\opa"], {1000}];]

Out[46]= {1.26563, Null}

In[44]:= Timing[Table[findOption5[text, "\\def\\opa"], {1000}];]

Out[44]= {1.23438, Null}

Its funny, that the times of your "findOption4" and my "findOption5" are now nearly the same.

But I agree, your code is more elegant and really easier to understand.

POSTED BY: Juerg Baertsch

Juerg,

It appears that you are parsing LaTeX so I think your code breaks if the square "]" is a "}" or if the Typesetting only has an opening "{" and no closing one.

I think this code should do what you want and ignore typeset "{" or "}" marks. -- and I think it is a bit easier to understand.

findOption2[text_String, option_String] := 
 Module[{possible, opencnt, closecnt}, 
  possible = StringCases[text, option ~~ "{" ~~ x___ ~~ "}" -> x,  Overlaps -> All]; 
  opencnt = StringCount[possible, "{"] - StringCount[possible, "\{"];
  closecnt =  StringCount[possible, "}"] - StringCount[possible, "\}"];
  option <> "{" <>  possible[[Position[opencnt - closecnt, 0][[-1, 1]]]] <> "}"]

The code above avoids looping (looping is generally not "Mathematica Friendly"). I first get all possible close "}" strings for the option. Next I count the open "{" characters and remove any typeset ones "{". Next I count the close "}" characters and remove any typeset ones "}". Lastly I find the strings that have equal number of open "{" and close "}". I take the shortest string among the ones that match open/close brackets to get the option value.

It relies on a subtle behavior of MMA which is that an Overlapping StringCases seems to put the largest strings first. If this is not always true than the code would have to change a bit to find the element in the list that is 0 with the smallest corresponding value in opencnt.

Regards,

Neil

POSTED BY: Neil Singer
Posted 7 years ago

Neil,

Wow, thats a cool idea and very diffrent from mine.

It appears that you are parsing LaTeX so I think your code breaks if the square "]" is a "}" or if the Typesetting only has an opening "{" and no closing one.

I'm not sure if I understand correctly, but in this case the "{" and "}" are written as "\{" and "\}". This situation should be solved in my code with the first test in the following "Which"

   Which[
    i > 1 && chars[[i - 1]] == "\\", ,
    chars[[i]] == "{", sum = sum + 1,
    chars[[i]] == "}", sum = sum - 1
    ];

The only problem I could imagine is, when there is a commentary inside which has not to follow any syntax-rules. But I know, that in the options are no commentaries inside, so that this shouldnt be a problem for my work.

The code above avoids looping (looping is generally not "Mathematica Friendly")

Do you mean, that I should try to not use "While"? Is this for performance-reasons?

I compared our two functions with a bigger file (attached):

In[39]:= Timing[findOption[text, "\\def\\opa"];]

Out[39]= {0.015625, Null}

In[40]:= Timing[findOption2[text, "\\def\\opa"];]

Out[40]= {25.9219, Null}

I think, that in bigger files with lots of "{" and "}" the variable "possible" of your function is exploding and makes it very slow, so that this solution is not useful for my situation.

But thank you anyway, I like it very much to see different ideas.

Attachments:
POSTED BY: Juerg Baertsch
Posted 7 years ago

I've done the work now with a workaround instead of using RegularExpressions:

findOption[text_String, option_String] := Module[
  {pos, chars, sum, i},
  pos = StringPosition[text, option <> "{"];
  If[Length[pos] == 0, Return[False], pos = First[pos]];
  chars = Characters[StringTake[text, {pos[[2]] + 1, -1}]];
  sum = 0; i = 1;
  While[
   sum != -1 && i <= Length[chars],
   Which[
    i > 1 && chars[[i - 1]] == "\\", ,
    chars[[i]] == "{", sum = sum + 1,
    chars[[i]] == "}", sum = sum - 1
    ];
   i++];
  Return[pos + {0, i}]
  ]

This works now pretty good:

In[88]:= text

Out[88]= "\\input{mystandard}
    \\begin{document}
    % bla bla bla { { { } } } 
    \\def\\opx{$\\left\\{\\cfrac{x}{y}\\right]$}
    % bla bla bla { { { } } }
    \\opx
\\end{document}"

In[89]:= pos = findOption[text, "\\def\\opx"]

Out[89]= {73, 111}

In[90]:= StringTake[text, pos]

Out[90]= "\\def\\opx{$\\left\\{\\cfrac{x}{y}\\right]$}"

So I'm happy with how it works :) Thank you all for your help, and sorry again for my bad and unclear examples.

POSTED BY: Juerg Baertsch

Juerg,

I think the easiest way to do this is to go back to your original example and add Longest to the pattern and simplify it.

In[1]:= string2 = "abcde(fgh(ij)k)lmn"
justTheInsides = StringCases[string2, Longest["e(" ~~ x___ ~~ ")"] -> x][[1]]

Out[1]= "abcde(fgh(ij)k)lmn"

Out[2]= "fgh(ij)k"

Now you can do the German umlaut without any trouble -- you can even go back to your original approach of finding the position and using StringTake (in which case you do not need to name the string "x"):

pos2 = First[StringPosition[string2, Longest["e(" ~~ ___ ~~ ")"]]]
StringTake[string2, pos2 + {2, -1}]

Now on umlauts the code still works:

In[106]:= string3 = "\\def\\opc{Z\\\"{u}rich}"
StringCases[string3, Longest["opc{" ~~ x___ ~~ "}"] -> x][[1]]

Out[106]= "\\def\\opc{Z\\\"{u}rich}"

Out[107]= "Z\\\"{u}rich"

Regards,

Neil

POSTED BY: Neil Singer
Posted 7 years ago

Thank you Neil

Good idea, but if the closing bracket is not the last one, it doesen't work anymore:

In[82]:= string = "abcde(fgh(ij)k)lmn(opq)rs"
justTheInsides = 
 StringCases[string, Longest["e(" ~~ x___ ~~ ")"] -> x][[1]]

Out[82]= "abcde(fgh(ij)k)lmn(opq)rs"

Out[83]= "fgh(ij)k)lmn(opq"
POSTED BY: Juerg Baertsch
Posted 7 years ago

Hello Henrik & Gianluca

Thank you very much for your answers.

Sorry, my example was not very accurate, I tried to simplfiy it but that was not a good idee.

The string is not an expression, its one of thousends of text-files. In this files it has for example a part with the following text:

\def\opa{2018}
\def\opb{Winter}
\def\opn{a}
\def\opc{Z\"{u}rich}
\def\opd{1}

I want read out the location Z\"{u}rich and replace \def\opc{Z\"{u}rich} with \def\opLocation{Z\"{u}rich}.

The solution of Henrik doesnt work for my problem. And I think the solution of Gianluca doesn't help too, but I dont really understand it and have to think about it first.

Sorry for the bad example!

POSTED BY: Juerg Baertsch

Hi Juerg,

I want read out the location Z\"{u}rich and replace \def\opc{Z\"{u}rich} with \def\opLocation{Z\"{u}rich}.

This appears to be a different thing (but nevertheless simplifying examples is generally a good idea!) and could be done simply like so:

text = "\\def\\opa{2018}
  \\def\\opb{Winter}
  \\def\\opn{a}
  \\def\\opc{Z\\\"{u}rich}
  \\def\\opd{1}";
StringPosition[text, "\\opc{Z\\\"{u}rich}"]
StringReplace[text, "\\opc{Z\\\"{u}rich}" -> "\\opLocation{{Z\\\"{u}rich}"]

Or am I still misunderstanding your problem?

Regards -- Henrik

POSTED BY: Henrik Schachner
Posted 7 years ago

The problem is that the location in "\def\opc{location}" could be anything, "\def\opc{Z\"{u}rich}" was only an example, which shows the issue that the location could have brackets too, so its not so easy, to find out, which is the closing bracket. I think there should be an easy way with regular expressions, but I dont know much about it and I have to learn it first.

As a workaround I could lookup for "\def\opc{" only, then make characters of the following string and I can count "{" as +1 and "}" as -1 and stop if the sum is -1.

text = "\\def\\opa{2018}
    \\def\\opb{Winter}
    \\def\\opn{a}
    \\def\\opc{Z\\\"{u}rich}
    \\def\\opd{1}";

pos1 = First[StringPosition[text, "\\def\\opc{"]];
chars = Characters[StringTake[text, {pos1[[2]] + 1, -1}]];
sum = 0; i = 1;
While[
  sum != -1 && i <= Length[chars],
  Which[
   chars[[i]] == "{", sum = sum + 1,
   chars[[i]] == "}", sum = sum - 1
   ];
  i++];

location = StringTake[text, pos1 + {9, i - 2}]

(* Out location: "Z\\\"{u}rich" *)

text1 = StringReplacePart[text, 
  "\\def\\opLocation{" <> location <> "}", pos1 + {0, i - 1}]

(* Out text1: 
"\\def\\opa{2018}
    \\def\\opb{Winter}
    \\def\\opn{a}
    \\def\\opLocation{Z\\\"{u}rich}
    \\def\\opd{1}" 
*)

If I dont find a solution with regular Expressions, I try it with this workaround, but its not really nice.

POSTED BY: Juerg Baertsch

Well, as it seems the German "Umlauts" are the problem - and in particular the brackets there. How about getting rid of them as a very first step, then doing all manipulation/searching/etc. with the data, and as a last step putting them back in (if necessary):

text = "\\def\\opa{2018}
    \\def\\opb{Winter}
    \\def\\opn{a}
    \\def\\opc{Z\\\"{u}rich}
    \\def\\opd{1}";
umlautRule = {"\\\"{a}" -> "$$$a", "\\\"{o}" -> "$$$o", "\\\"{u}" -> "$$$u", "\\\"{s}" -> "$$$s"};
invUmlautRule = Reverse /@ umlautRule;
text1 = StringReplace[text, umlautRule];
(* data manipulation: *)   result0 = text1;
result = StringReplace[result0, invUmlautRule];

The programming should be nicer then.

POSTED BY: Henrik Schachner
Posted 7 years ago

In the example its a "German Umlaut", but there are other options that could contain any LaTeX-Code, as an example:

"\\def\\opx{\frac{x}{e^{x-1}}}"

So its not possible to change the "inner brackets" temporarly.

POSTED BY: Juerg Baertsch

Here is my hack:

myString = "abcde(fgh(ij)k)lmn";
subsequences = StringSplit[myString, "(" | ")"];
replacements = 
  Thread[subsequences -> Map[ToString, Range[Length[subseq]]]];
inverseReplacements = Thread[Range[Length[subseq]] -> subsequences];
StringJoin["{", 
  StringSplit[myString, x : "(" | ")" :> x] /. 
    replacements /. {"(" -> ",{", ")" -> "},"}, "}"];
% // ToExpression;
% /. inverseReplacements
Cases[%, _List, {1, Infinity}];
Map[StringJoin[Flatten[#]] &, %]

If I were sure that the string does not contain curly braces it would be half as long. The idea is to convert the string to a WL nested list expression delimited by curly braces. Then it is easy to extract sublists with Cases.

POSTED BY: Gianluca Gorni

Hi Juerg,

I am sure there is a much more elegant solution, but maybe this is in some way helpful:

string2 = "abcde(fgh(ij)k)lmn";
expr = MakeExpression[RowBox[Characters[string2]], StandardForm];
Level[expr, {Depth[expr] - 1}]
(*  Out:  {i,j}   *)
Level[expr, {Depth[expr] - 2}]
(*  Out:  {f,g,h,i j,k}  *)

Regards -- Henrik

POSTED BY: Henrik Schachner
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract