Message Boards Message Boards

0
|
5821 Views
|
4 Replies
|
3 Total Likes
View groups...
Share
Share this post:

Finding a substring using StringCases?

Posted 7 years ago

Hi everyone, The string text in the attached file contains some XML-like text. I want to extract the chunk of text starting with <FILER>, ending with </FILER>, and containing the substring id. There are three chunks that begin with <FILER> and end with </FILER> but only one of them contains id. I can't get my code right because it grabs the chunk with id and the one before it.

StringCases[text, Shortest["<FILER>" ~~ __ ~~ id ~~ __ ~~ "</FILER>"],Overlaps -> False]

Any tips would be much appreciated.

Greg

Attachments:
POSTED BY: Gregory Lypny
4 Replies
Posted 7 years ago

Hi Sander,

I appreciate the heads up about Shortest's limitations. I am scraping info from a sample of about 154,000 government documents and a lot of the code makes use of Shortest. I'm going to rethink some of it because I can't afford missed observations or the wrong data being extracted. I'm also going to do a speed comparison of variations of StringSplit with StringCases.

Thanks once again,

Gregory

POSTED BY: Gregory Lypny
Posted 7 years ago

Hi Sander,

Thanks for replying. That explains why I have always had problems with Shortest[…] as applied to strings. The Overlaps options also gives me results that I do not expect.

I'll try your fix. I never would have thought of straddling the pattern with

___

Another approach that works is to first split the main string using <f> and </f> as delimiters. Looks like I will be changing a lot of my code.

Thanks again,

Gregory

POSTED BY: Gregory Lypny

Hi Gregory,

I prefer the split and check method generally. I've used construct like these quite often:

str = "<f>test</f><f>WWW</f><f>test</f><f>test</f><f>test</f><f>test</f><f>WWW</f><f>test</f>"
Select[StringCases[str, Shortest["<f>" ~~ ___ ~~ "</f>"]], StringContainsQ["WWW"]]

Or constructs using StringSplit, also work nicely.

Unfortunately I can't explain to you why exactly it does not work, I think it is very subtle. Just remember that Shortest and strings don't work as you (I guess) expect...

POSTED BY: Sander Huisman

Hi Gregory,

This is a common pitfall of Shortest, as it does not get the Shortest expressions when used with Strings.

StringCases[text, ___ ~~ y : Shortest["<FILER>" ~~ ___ ~~ id ~~ ___ ~~ "</FILER>"] ~~ ___ :>y]

That does work. Sometimes one needs to specify and Except[...] inside as well.

The problem is that strings are matched in a certain way from left to right such that:

StringCases["<f>test</f><f>WWW</f><f>test</f>",  Shortest["<f>" ~~ ___ ~~ "WWW" ~~ ___ ~~ "</f>"]]

will return

"<f>test</f><f>WWW</f>"

not

"<f>WWW</f>"

If you really want to know why you have to read the PCRE documentation as (as far as I know) the Wolfram-patterns are converted to PCRE and then that library is used for matching.

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract