Message Boards Message Boards

0
|
71433 Views
|
3 Replies
|
2 Total Likes
View groups...
Share
Share this post:

StringSplit to split at word boundaries but excluding single quote

To StringSplit a string at word boundaries, I use the code

ParseText[ text_] :=
 StringSplit[ToLowerCase[text], RegularExpression["\\W+"]]

Applying this function results in a problem result. Thus,

   sample1 = "These are the times that try men's souls."; 
  ParseText[sample1]

returns

{"these", "are", "the", "times", "that", "try", "men", "s", "souls"}

However, this results in splitting words like "men's", "can't", "don't", etc into two words. What regular expression can I use to keep words like these intact?

POSTED BY: Lawrence Winkler
3 Replies

I am not that good with Regular Expressions so here is a possibility using a StringExpression:

In[1]:= ParseText[text_] :=  StringSplit[ToLowerCase[text], Except[LetterCharacter | "'"]]

In[2]:= sample1 = "These are the times that try men's souls.";

In[3]:= ParseText[sample1]

Out[3]= {"these", "are", "the", "times", "that", "try", "men's", "souls"}
POSTED BY: David Reiss

I used your suggestion, with a simple modification. So now we have the original and the enhanced.

ParseText[
  text_] := (* text is string, result is lower case list of words within text *)
  StringSplit[ToLowerCase[text], RegularExpression["\\W+"]]

ParseText2[
  text_] := (* text is string, result is lower case list of words within text *)
  StringSplit[ToLowerCase[text], Except[WordCharacter | "'"]]

One would think the enhanced solves all problems. Alas, no such luck.

In[227]:= sample3 = "These are the times, that try men's souls.";

In[228]:= ParseText[sample3]

Out[228]= {"these", "are", "the", "times", "that", "try", "men", "s",  "souls"}

In[229]:= ParseText2[sample3]

Out[229]= {"these", "are", "the", "times", "", "that", "try", "men's", "souls"}

Merely adding a comma in the sample phrase results in the space after the comma being recognized as a separate word. According to the documentation RegularExpression["\W+"] is equivalent to Except[WordCharacter], so adding the pattern "[RawQuote]" to the Except pattern should not cause problems. It does. If this result is not a bug in Mathematica, I don't understand its logic.

POSTED BY: Lawrence Winkler

Lawrence:

Why not try the opposite - split at white space and/or a list of other atomic characters to exclude so we would use:

In[1]:= parseText[text_] := 
  StringSplit[
   ToLowerCase[text], {RegularExpression["\\s|[\\.,\\,,\\:,\\;]"]}];
sample1 = "These are the times that try men's souls.";
parseText[sample1]

Out[3]= {"these", "are", "the", "times", "that", "try", "men's", \
"souls"}

keep adding after the "|" in the regular expressions as you need to exclude other characters. If you need to exclude single quotes when they are not representing contractions, such as "say 'hello world'" the single quote will be part of the words. The http://regexlib.com/ is a good place to start.

POSTED BY: Hans Michel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract