Message Boards Message Boards

Find email addresses in text using TextCases? "Recursion limit problem"

GROUPS:

Windows 10 MM11.1

I'm mining through some email and would like to retrieve email addresses. Wanted to use the easy TexCases but it seems that when I use this it generates a recursion limit problem. With a lot of outlook mails you often see the following format:

"Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"

When I use TextCases

TextCases["Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>", "EmailAddress"]

RegularExpression::maxrec: Recursion limit exceeded; positive match might be missed.

It seems that the string before the < creates the problem when we use an _ character but the length of the string seems also important.

TextCases["abcdefg_hijklmnopqrst <Lab_Wolfram_Interest_Group@groups.wolfram.com>", "EmailAddress"]

generates an error.

The following seems ok:

TextCases["abcdefg_hijklmnopqrs <Lab_Wolfram_Interest_Group@groups.wolfram.com>", "EmailAddress"]

Any thoughts?

POSTED BY: l van Veen
Answer
1 year ago

I've looked at the code of TextCases a bit but can't say 100% sure what it does for the type "EmailAddress", however I do see that some types simply get routed to Interpreter:

Interpreter[type][strings]

If you try this:

Interpreter["EmailAddress"]["Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"]
"Lab_Wolfram_Interest_Group@groups.wolfram.com"

it does work.

POSTED BY: Sander Huisman
Answer
1 year ago

If you follow the command with your input:

str="Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"
TextCases[str,"EmailAddress"]

calls:

TextCases[str,{"EmailAddress"}]

which calls:

NaturalLanguageProcessing`iTextCases[str,{"EmailAddress"}->"String"]

which calls:

TextPosition[str,{"EmailAddress"}]
POSTED BY: Sander Huisman
Answer
1 year ago

And after some deep digging, you get the string pattern:

NaturalLanguageProcessing`$TextPatternTable["EmailAddress"]
StringPosition[str, %]

How that is converted to regular expressions I'm not sure, I'll leave that puzzle to other people:

NaturalLanguageProcessing`$TextPatternTable["EmailAddress"]
StringPattern`PatternConvert[%]

{(?ms)(?:(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}_%+\-])+\.{0,1})+@(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}.\-])+\.(?:[[:alpha:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}]){2,4},{},{},Hold[None]}
POSTED BY: Sander Huisman
Answer
1 year ago

Would splitting the string be acceptable?

Flatten[TextCases[#, "EmailAddress"] & /@ StringSplit["Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>"], 1]
POSTED BY: Pedro Fonseca
Answer
1 year ago

This is a strange but annoying problem! I noticed you can do a stack trace. Is this the pattern used to find mail addresses?

enter image description here

So when I even use the standard StringCases I run into the same max recursion problem. I would think this must be a bug..?

str = "Lab_Wolfram_Interest_Group <Lab_Wolfram_Interest_Group@groups.wolfram.com>";
StringCases[str, ((WordCharacter | "_" | "%" | "+" | "-") ..~~Repeated[Verbatim["."], {0, 1}]) ..~~Verbatim["@"] ~~WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False]
POSTED BY: l van Veen
Answer
1 year ago

I didn't use the stack trace actually, that would be much easier haha. I did it in another way, anyhow, yes that is the email-pattern. The patterns for various things can be found here:

NaturalLanguageProcessing`$TextPatternTable // Keys // Sort // Column
NaturalLanguageProcessing`PackageScope`$WordSplitterPatterns // Keys // Sort // Column

both of them are associations with a bunch of patterns as values.

POSTED BY: Sander Huisman
Answer
1 year ago

Hi Sander, great to know! btw.. How do you know all this stuff? can't find it in any book. You should write one :)

POSTED BY: l van Veen
Answer
1 year ago

Well, I don't know these things by heart, but you can do:

Needs["GeneralUtilities`"]
PrintDefinitions@NaturalLanguageProcessing`TextPosition`PackagePrivate`iTextPosition

to see the internals of TextPosition and click on the various links you will see in that document and go deeper in to the code and so on...

POSTED BY: Sander Huisman
Answer
1 year ago

It seem to me the pattern used is "wrong" When I change it to

f[x_] := StringCases[x, (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False]

it seems to work like a charm.

POSTED BY: l van Veen
Answer
1 year ago

That would allow for email addresses starting with a . which is not allowed as far as I know...

POSTED BY: Sander Huisman
Answer
1 year ago

You're right. But the addresses have already been evaluated (otherwise I wouldn't have received them) But what about if I force a wordcharacter first. Gives the same result but resolves your point.

f[x_] := StringCases[x, 
  WordCharacter ~~ (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~
    Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~
    Repeated[LetterCharacter, {2, 4}], Overlaps -> False]
POSTED BY: l van Veen
Answer
1 year ago

Well, I think now you can still have multiple . in a row; I think the original stringpattern (stringexpression) is correct, it is either the conversion to RegularExpression that goes wrong, or the evaluation of the regularexpression... I would call this a bug indeed, do you mind sending it in as product feedback?

POSTED BY: Sander Huisman
Answer
1 year ago

Hi Sander, I submitted a case. This is in my opinion a bug indeed. I'll post the outcome.

POSTED BY: l van Veen
Answer
1 year ago

I guess, there should be no recursion problem for any regular expression without function-calls in them...

POSTED BY: Sander Huisman
Answer
1 year ago

Bug is solved in version 11.3

POSTED BY: l van Veen
Answer
26 days ago

Group Abstract Group Abstract