Group Abstract

Message Boards

WOLFRAM COMMUNITY

75.1K Views

3 Replies

2 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Mathematica Wolfram Language

StringSplit to split at word boundaries but excluding single quote

Lawrence Winkler

Lawrence Winkler, UW Madison (retired)

Posted 12 years ago

To StringSplit a string at word boundaries, I use the code ParseText[ text_] := StringSplit[ToLowerCase[text], RegularExpression["\\W+"]] Applying this function results in a problem result. Thus, sample1 = "These are the times that try men's souls."; ParseText[sample1] returns {"these", "are", "the", "times", "that", "try", "men", "s", "souls"} However, this results in splitting words like "men's", "can't", "don't", etc into two words. What regular expression can I use to keep words like these intact?

POSTED BY: Lawrence Winkler

3 Replies

Sort By:

Hans Michel

Hans Michel, Michel Information Services

Posted 12 years ago

Lawrence: Why not try the opposite - split at white space and/or a list of other atomic characters to exclude so we would use: In[1]:= parseText[text_] := StringSplit[ ToLowerCase[text], {RegularExpression["\\s\|[\\.,\\,,\\:,\\;]"]}]; sample1 = "These are the times that try men's souls."; parseText[sample1] Out[3]= {"these", "are", "the", "times", "that", "try", "men's", \ "souls"} keep adding after the "\|" in the regular expressions as you need to exclude other characters. If you need to exclude single quotes when they are not representing contractions, such as "say 'hello world'" the single quote will be part of the words. The http://regexlib.com/ is a good place to start.

POSTED BY: Hans Michel

Lawrence Winkler

Lawrence Winkler, UW Madison (retired)

Posted 12 years ago

I used your suggestion, with a simple modification. So now we have the original and the enhanced. ParseText[ text_] := (* text is string, result is lower case list of words within text ) StringSplit[ToLowerCase[text], RegularExpression["\\W+"]] ParseText2[ text_] := ( text is string, result is lower case list of words within text *) StringSplit[ToLowerCase[text], Except[WordCharacter \| "'"]] One would think the enhanced solves all problems. Alas, no such luck. In[227]:= sample3 = "These are the times, that try men's souls."; In[228]:= ParseText[sample3] Out[228]= {"these", "are", "the", "times", "that", "try", "men", "s", "souls"} In[229]:= ParseText2[sample3] Out[229]= {"these", "are", "the", "times", "", "that", "try", "men's", "souls"} Merely adding a comma in the sample phrase results in the space after the comma being recognized as a separate word. According to the documentation RegularExpression["\W+"] is equivalent to Except[WordCharacter], so adding the pattern "[RawQuote]" to the Except pattern should not cause problems. It does. If this result is not a bug in Mathematica, I don't understand its logic.

I used your suggestion, with a simple modification. So now we have the original and the enhanced.

ParseText[
  text_] := (* text is string, result is lower case list of words within text *)
  StringSplit[ToLowerCase[text], RegularExpression["\\W+"]]

ParseText2[
  text_] := (* text is string, result is lower case list of words within text *)
  StringSplit[ToLowerCase[text], Except[WordCharacter | "'"]]

One would think the enhanced solves all problems. Alas, no such luck.

In[227]:= sample3 = "These are the times, that try men's souls.";

In[228]:= ParseText[sample3]

Out[228]= {"these", "are", "the", "times", "that", "try", "men", "s",  "souls"}

In[229]:= ParseText2[sample3]

Out[229]= {"these", "are", "the", "times", "", "that", "try", "men's", "souls"}

Merely adding a comma in the sample phrase results in the space after the comma being recognized as a separate word. According to the documentation RegularExpression["\W+"] is equivalent to Except[WordCharacter], so adding the pattern "[RawQuote]" to the Except pattern should not cause problems. It does. If this result is not a bug in Mathematica, I don't understand its logic.

POSTED BY: Lawrence Winkler

David Reiss

David Reiss, Scientific Arts

Posted 12 years ago

I am not that good with Regular Expressions so here is a possibility using a StringExpression: In[1]:= ParseText[text_] := StringSplit[ToLowerCase[text], Except[LetterCharacter \| "'"]] In[2]:= sample1 = "These are the times that try men's souls."; In[3]:= ParseText[sample1] Out[3]= {"these", "are", "the", "times", "that", "try", "men's", "souls"}

I am not that good with Regular Expressions so here is a possibility using a StringExpression:

In[1]:= ParseText[text_] :=  StringSplit[ToLowerCase[text], Except[LetterCharacter | "'"]]

In[2]:= sample1 = "These are the times that try men's souls.";

In[3]:= ParseText[sample1]

Out[3]= {"these", "are", "the", "times", "that", "try", "men's", "souls"}

POSTED BY: David Reiss

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback