Message Boards Message Boards

0
|
5917 Views
|
3 Replies
|
0 Total Likes
View groups...
Share
Share this post:
GROUPS:

How Can I Extract Capitalised Words and Phrases From a Text?

Posted 10 years ago

I have data with plenty of text strings where key words and phrases are capitalised. I want to extract the capitalised bits, but NOT breaking down capitalised phrases into their constituent parts. Some pseudocode explains best:

text = "Here is some text where KEY words and EXTENDED PHRASES are in UPPER CASE.";
< SOME ELEGANT CODE  EXTRACTS...>
 {"KEY", "EXTENDED PHRASES", "UPPER CASE"}

...but does not break the extended phrases into their constituent parts. So the result is NOT...

{"KEY", "EXTENDED", "PHRASES", "EXTENDED PHRASES", "UPPER", "CASE", "UPPER CASE"}

N.B. - The extended capitalised phrases can be of any length.

I can think of plenty of clumsy ways of doing this (in fact, these days I am getting pretty good at clumsy). Has anyone got an elegant way?

Thanks in advance

Brad

POSTED BY: Brad Varey
3 Replies

This seems to work in your example:

   StringCases["Here is some text where KEY words and EXTENDED PHRASES  are in UPPER CASE.", 
   RegularExpression["[A-Z][ A-Z]+[A-Z]"]]
POSTED BY: Gianluca Gorni
Posted 10 years ago

Thanks, Gianluca,

Although I think I can fix it myself, your solution does choke a bit on an input like...

"DONE Here Is some text where KEY words and EXTENDED PHRASES are in \
UPPER CASE."

Where it does properly catch DONE, but then returns as well just the "H" of "Here". My returned value was

{"DONE H", "KEY", "EXTENDED PHRASES", "UPPER CASE"}

Since posting, it also occurs to me I will need to include words which are all upper case but which include certain punctuation. So, the following should also be considered upper case:

DON'T NON-EXISTENT R2D2

and so on. But I think I can fix it myself, using your example as a springboard.

Thanks!

POSTED BY: Brad Varey
text = "This is a TEXT with QQ or Q or QQQ or some UPPER CASE WORDS to test a FUNCTION"    
DeleteCases[ StringTrim@StringCases[text, Longest[WordBoundary ~~ (_?UpperCaseQ | WhitespaceCharacter) .. ~~  WordBoundary]], ""]

This gives {"TEXT", "QQ", "Q", "QQQ", "UPPER CASE WORDS", "FUNCTION"}

POSTED BY: l van Veen
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract