Group Abstract

Message Boards

WOLFRAM COMMUNITY

5.6K Views

3 Replies

3 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Mathematica Wolfram Language

removing lines with StringReplace

Joel Klein

Joel Klein, WOLFRAM

Posted 13 years ago

I want to remove lines of text that match a pattern. I'm not sure why the first line is not matched by the replacement rule here: StringReplace["x \na", {"x" ~~ (Except[EndOfLine] ...) ~~ EndOfLine -> ""}] If I have this shorter version: StringReplace["x \na", {"x" ~~ (Except[EndOfLine] ...) -> ""}] then it matches and blanks out the x, but leaves the line there. When I add ~~ EndOfLine to the pattern, suddenly the x line stops being matched. Eventually I think I want StringReplace["x \na", {"x" ~~ (Except[EndOfLine] ...) ~~ EndOfLine ~~ "\n" -> ""}] so that the newline character itself is matched and removed.

POSTED BY: Joel Klein

3 Replies

Sort By:

Zach Bjornson

Zach Bjornson, Wolfram Research

Posted 13 years ago

Do you need to use Except with EndOfLine? If not, do this work for you? In[1]:= StringReplace["x \na \nx \nc", RegularExpression["x.*\n"] -> ""] // InputForm Out[1]//InputForm= "a \nc" In[2]:= StringReplace["x \na \nx \nc", {Shortest["x" ~~ ___ ~~ EndOfLine ~~ "\n"] -> ""}] // InputForm Out[2]//InputForm= "a \nc"

POSTED BY: Zach Bjornson

Szabolcs Horvát

Posted 13 years ago

All this said, I would approach the problem a bit differently: first StringSplit[..., "\n"], then just use Cases/DeleteCases/Select to get the lines I want. And maybe StringJoin[] it back together if necessary (Riffle[]ing in the "\n" separators).

POSTED BY: Szabolcs Horvát

Szabolcs Horvát

Posted 13 years ago

First, I must state that I know very little about regular expressions, so take all of this with a grain of salt. My guess was that string patterns either translate to regular expressions, or very likely call a regex library behind the scenes. So, it looks like we need to do some more spelunking today Trying to find out if the former is the case, I found StringPattern`PatternConvert, which seems to convert strings patterns to regexes. Whether this is really how string patterns work, or this function is used for something else, I do not know ... But let's work under the assumption that all string patterns get translated to regexes first. (I do have to note though that StringPattern`PatternConvert is written in Mathematica, and I can read its code, while Information[] reveals no readable code for StringMatchQ. Also, Trace, wouldn't reveal any calls to PatternConvert when running StringMatchQ) Some observations: In regular expressions, EndOfLine corresponds to $. It indicates not a character, but a position in the string. $ does not actually consume a character when matching. It wouldn't make much sense to use $ in a way you used EndOfLine in Except. In string patterns, Except is only allowed to contain a single character, but not a full word or a full pattern. Except["word"] issues an error. However, it is allowed to contain a list of characters too, as in Except[{"a", "b", "c"}]. Now let's see what Except translates to: StringPattern`PatternConvert[Except["a"]] --> {"(?ms)[^a]", {}, {}, Hold} So it translates to character classes. The "(?ms)" part is documented --- it indicates that newlines are allowed in the string. With more than one character, we get what one would expect, a character class with several characters: StringPattern`PatternConvert[Except[{"a", "b", "c"}]] --> {"(?ms)[^abc]", {}, {}, Hold} $ can't be used as an end-of-line indicator in character classes, so let's see what Except translates to: StringPattern`PatternConvert[Except] --> {"(?ms)(?!$)", {}, {}, Hold} I had to look (?! ... ) up: http://www.regular-expressions.info/refadv.html It says, "Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match." The first pattern you mention, "x" ~~ (Except ...) ~~ EndOfLine, translates to (?ms)x(?:(?!$))$ (This site desperately needs inline code spans to get rid of smileys ... ) The (?: ... ) part is merely or grouping. The problem with this regex is that it does not contain "." patterns, which could consume characters. $ does not consume characters. So it can't consume the " " (space) to be able to get to the end of the line. If we modify it like this, "(?ms)x(?:(?!$).)$" and use it inside RegularExpression, then it will work. The answer to your second question, why "x" ~~ (Except ...) will leave the " " (the space) in the string is again that there's no "." which could consume any characters. There's no end of line after the "x", so it does match, but it is unable to consume the line. Finally, the pattern you could use is this: StringReplace["x ", {"x" ~~ ___ ~~ ("\n" \| EndOfLine) -> ""}] It is not necessary to use Except because if there is an "\n" in the string, then there is certainly an EndOfLine. We still need to write the end as ("\n" \| EndOfLine) because the string has an EndOfLine at the very end too, even though it may not have a newline there. I hope this helps and I hope it's clear enough. The editor on this site is much more difficult to use than on SE and it's a bit buggy ...

POSTED BY: Szabolcs Horvát

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback