Message Boards Message Boards

1
|
5169 Views
|
3 Replies
|
3 Total Likes
View groups...
Share
Share this post:

removing lines with StringReplace

Posted 12 years ago
I want to remove lines of text that match a pattern. I'm not sure why the first line is not matched by the replacement rule here:
StringReplace["x \na", {"x" ~~ (Except[EndOfLine] ...) ~~ EndOfLine -> ""}]
If I have this shorter version:
StringReplace["x \na", {"x" ~~ (Except[EndOfLine] ...) -> ""}]
then it matches and blanks out the x, but leaves the line there. When I add ~~ EndOfLine to the pattern, suddenly the x line stops being matched. Eventually I think I want
StringReplace["x \na", {"x" ~~ (Except[EndOfLine] ...) ~~ EndOfLine ~~ "\n" -> ""}]
so that the newline character itself is matched and removed.
POSTED BY: Joel Klein
3 Replies
First, I must state that I know very little about regular expressions, so take all of this with a grain of salt.

My guess was that string patterns either translate to regular expressions, or very likely call a regex library behind the scenes.  So, it looks like we need to do some more spelunking today emoticon Trying to find out if the former is the case, I found StringPattern`PatternConvert, which seems to convert strings patterns to regexes.  Whether this is really how string patterns work, or this function is used for something else, I do not know ...  But let's work under the assumption that all string patterns get translated to regexes first.  (I do have to note though that StringPattern`PatternConvert is written in Mathematica, and I can read its code, while Information[] reveals no readable code for StringMatchQ.  Also, Trace, wouldn't reveal any calls to PatternConvert when running StringMatchQ)

Some observations:
  • In regular expressions, EndOfLine corresponds to $.  It indicates not a character, but a position in the string.  $ does not actually consume a character when matching.  It wouldn't make much sense to use $ in a way you used EndOfLine in Except.
  • In string patterns, Except is only allowed to contain a single character, but not a full word or a full pattern.  Except["word"] issues an error.  However, it is allowed to contain a list of characters too, as in Except[{"a", "b", "c"}].

Now let's see what Except translates to:  StringPattern`PatternConvert[Except["a"]]  -->  {"(?ms)[^a]", {}, {}, Hold}  So it translates to character classes. The "(?ms)" part is documented --- it indicates that newlines are allowed in the string.

With more than one character, we get what one would expect, a character class with several characters: StringPattern`PatternConvert[Except[{"a", "b", "c"}]] -->  {"(?ms)[^abc]", {}, {}, Hold}

$ can't be used as an end-of-line indicator in character classes, so let's see what Except translates to:  StringPattern`PatternConvert[Except] --> {"(?ms)(?!$)", {}, {}, Hold}

I had to look (?! ... ) up: http://www.regular-expressions.info/refadv.html  It says, "Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match."

The first pattern you mention, "x" ~~ (Except ...) ~~ EndOfLine, translates to
(?ms)x(?:(?!$))*$
(This site desperately needs inline code spans to get rid of smileys ... )  The (?: ... ) part is merely or grouping.  The problem with this regex is that it does not contain "." patterns, which could consume characters.  $ does not consume characters.  So it can't consume the " " (space) to be able to get to the end of the line.  If we modify it like this, 
"(?ms)x(?:(?!$).)*$"
and use it inside RegularExpression, then it will work.

The answer to your second question, why "x" ~~ (Except ...) will leave the " " (the space) in the string is again that there's no "." which could consume any characters.  There's no end of line after the "x", so it does match, but it is unable to consume the line.

Finally, the pattern you could use is this:
StringReplace["x ", {"x" ~~ ___ ~~ ("\n" | EndOfLine) -> ""}]
It is not necessary to use Except because if there is an "\n" in the string, then there is certainly an EndOfLine.  We still need to write the end as ("\n" | EndOfLine) because the string has an EndOfLine at the very end too, even though it may not have a newline there.

I hope this helps and I hope it's clear enough.  The editor on this site is much more difficult to use than on SE and it's a bit buggy ... 
POSTED BY: Szabolcs Horvát
All this said, I would approach the problem a bit differently:  first StringSplit[..., "\n"], then just use Cases/DeleteCases/Select to get the lines I want.  And maybe StringJoin[] it back together if necessary (Riffle[]ing in the "\n" separators).
POSTED BY: Szabolcs Horvát
Do you need to use Except with EndOfLine? If not, do this work for you?
In[1]:= StringReplace["x \na \nx \nc", RegularExpression["x.*\n"] -> ""] // InputForm
Out[1]//InputForm= "a \nc"
In[2]:= StringReplace["x \na \nx \nc", {Shortest["x" ~~ ___ ~~ EndOfLine ~~ "\n"] -> ""}] // InputForm
Out[2]//InputForm= "a \nc"
POSTED BY: Zach Bjornson
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract