Group Abstract Group Abstract

Message Boards Message Boards

0
|
74K Views
|
3 Replies
|
2 Total Likes
View groups...
Share
Share this post:

StringSplit to split at word boundaries but excluding single quote

To StringSplit a string at word boundaries, I use the code

ParseText[ text_] :=
 StringSplit[ToLowerCase[text], RegularExpression["\\W+"]]

Applying this function results in a problem result. Thus,

   sample1 = "These are the times that try men's souls."; 
  ParseText[sample1]

returns

{"these", "are", "the", "times", "that", "try", "men", "s", "souls"}

However, this results in splitting words like "men's", "can't", "don't", etc into two words. What regular expression can I use to keep words like these intact?

POSTED BY: Lawrence Winkler
3 Replies
POSTED BY: Hans Michel
POSTED BY: David Reiss

I used your suggestion, with a simple modification. So now we have the original and the enhanced.

ParseText[
  text_] := (* text is string, result is lower case list of words within text *)
  StringSplit[ToLowerCase[text], RegularExpression["\\W+"]]

ParseText2[
  text_] := (* text is string, result is lower case list of words within text *)
  StringSplit[ToLowerCase[text], Except[WordCharacter | "'"]]

One would think the enhanced solves all problems. Alas, no such luck.

In[227]:= sample3 = "These are the times, that try men's souls.";

In[228]:= ParseText[sample3]

Out[228]= {"these", "are", "the", "times", "that", "try", "men", "s",  "souls"}

In[229]:= ParseText2[sample3]

Out[229]= {"these", "are", "the", "times", "", "that", "try", "men's", "souls"}

Merely adding a comma in the sample phrase results in the space after the comma being recognized as a separate word. According to the documentation RegularExpression["\W+"] is equivalent to Except[WordCharacter], so adding the pattern "[RawQuote]" to the Except pattern should not cause problems. It does. If this result is not a bug in Mathematica, I don't understand its logic.

POSTED BY: Lawrence Winkler
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard