Message Boards Message Boards

GROUPS:

Find email addresses in text using TextCases? "Recursion limit problem"

Posted 1 year ago
2188 Views
|
15 Replies
|
13 Total Likes
|

Windows 10 MM11.1

I'm mining through some email and would like to retrieve email addresses. Wanted to use the easy TexCases but it seems that when I use this it generates a recursion limit problem. With a lot of outlook mails you often see the following format:

"Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"

When I use TextCases

TextCases["Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>", "EmailAddress"]

RegularExpression::maxrec: Recursion limit exceeded; positive match might be missed.

It seems that the string before the < creates the problem when we use an _ character but the length of the string seems also important.

TextCases["abcdefg_hijklmnopqrst <Lab_Wolfram_Interest_Group@groups.wolfram.com>", "EmailAddress"]

generates an error.

The following seems ok:

TextCases["abcdefg_hijklmnopqrs <Lab_Wolfram_Interest_Group@groups.wolfram.com>", "EmailAddress"]

Any thoughts?

15 Replies

I've looked at the code of TextCases a bit but can't say 100% sure what it does for the type "EmailAddress", however I do see that some types simply get routed to Interpreter:

Interpreter[type][strings]

If you try this:

Interpreter["EmailAddress"]["Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"]
"Lab_Wolfram_Interest_Group@groups.wolfram.com"

it does work.

If you follow the command with your input:

str="Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"
TextCases[str,"EmailAddress"]

calls:

TextCases[str,{"EmailAddress"}]

which calls:

NaturalLanguageProcessing`iTextCases[str,{"EmailAddress"}->"String"]

which calls:

TextPosition[str,{"EmailAddress"}]

And after some deep digging, you get the string pattern:

NaturalLanguageProcessing`$TextPatternTable["EmailAddress"]
StringPosition[str, %]

How that is converted to regular expressions I'm not sure, I'll leave that puzzle to other people:

NaturalLanguageProcessing`$TextPatternTable["EmailAddress"]
StringPattern`PatternConvert[%]

{(?ms)(?:(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}_%+\-])+\.{0,1})+@(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}.\-])+\.(?:[[:alpha:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}]){2,4},{},{},Hold[None]}

Would splitting the string be acceptable?

Flatten[TextCases[#, "EmailAddress"] & /@ StringSplit["Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"], 1]
Posted 1 year ago

This is a strange but annoying problem! I noticed you can do a stack trace. Is this the pattern used to find mail addresses?

enter image description here

So when I even use the standard StringCases I run into the same max recursion problem. I would think this must be a bug..?

str = "Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>";
StringCases[str, ((WordCharacter | "_" | "%" | "+" | "-") ..~~Repeated[Verbatim["."], {0, 1}]) ..~~Verbatim["@"] ~~WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False]

I didn't use the stack trace actually, that would be much easier haha. I did it in another way, anyhow, yes that is the email-pattern. The patterns for various things can be found here:

NaturalLanguageProcessing`$TextPatternTable // Keys // Sort // Column
NaturalLanguageProcessing`PackageScope`$WordSplitterPatterns // Keys // Sort // Column

both of them are associations with a bunch of patterns as values.

Posted 1 year ago

Hi Sander, great to know! btw.. How do you know all this stuff? can't find it in any book. You should write one :)

Well, I don't know these things by heart, but you can do:

Needs["GeneralUtilities`"]
PrintDefinitions@NaturalLanguageProcessing`TextPosition`PackagePrivate`iTextPosition

to see the internals of TextPosition and click on the various links you will see in that document and go deeper in to the code and so on...

Posted 1 year ago

It seem to me the pattern used is "wrong" When I change it to

f[x_] := StringCases[x, (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False]

it seems to work like a charm.

That would allow for email addresses starting with a . which is not allowed as far as I know...

Posted 1 year ago

You're right. But the addresses have already been evaluated (otherwise I wouldn't have received them) But what about if I force a wordcharacter first. Gives the same result but resolves your point.

f[x_] := StringCases[x, 
  WordCharacter ~~ (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~
    Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~
    Repeated[LetterCharacter, {2, 4}], Overlaps -> False]

Well, I think now you can still have multiple . in a row; I think the original stringpattern (stringexpression) is correct, it is either the conversion to RegularExpression that goes wrong, or the evaluation of the regularexpression... I would call this a bug indeed, do you mind sending it in as product feedback?

Posted 1 year ago

Hi Sander, I submitted a case. This is in my opinion a bug indeed. I'll post the outcome.

I guess, there should be no recursion problem for any regular expression without function-calls in them...

Posted 4 months ago

Bug is solved in version 11.3

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract