# Find email addresses in text using TextCases? "Recursion limit problem"

 Windows 10 MM11.1I'm mining through some email and would like to retrieve email addresses. Wanted to use the easy TexCases but it seems that when I use this it generates a recursion limit problem. With a lot of outlook mails you often see the following format: "Lab_Wolfram_Interest_Group " When I use TextCases TextCases["Lab_Wolfram_Interest_Group ", "EmailAddress"] RegularExpression::maxrec: Recursion limit exceeded; positive match might be missed. It seems that the string before the < creates the problem when we use an _ character but the length of the string seems also important. TextCases["abcdefg_hijklmnopqrst ", "EmailAddress"] generates an error.The following seems ok: TextCases["abcdefg_hijklmnopqrs ", "EmailAddress"] Any thoughts?
Posted 1 year ago
 I've looked at the code of TextCases a bit but can't say 100% sure what it does for the type "EmailAddress", however I do see that some types simply get routed to Interpreter: Interpreter[type][strings] If you try this: Interpreter["EmailAddress"]["Lab_Wolfram_Interest_Group "] "Lab_Wolfram_Interest_Group@groups.wolfram.com" it does work.
Posted 1 year ago
 If you follow the command with your input: str="Lab_Wolfram_Interest_Group " TextCases[str,"EmailAddress"] calls: TextCases[str,{"EmailAddress"}] which calls: NaturalLanguageProcessingiTextCases[str,{"EmailAddress"}->"String"] which calls: TextPosition[str,{"EmailAddress"}] 
Posted 1 year ago
 And after some deep digging, you get the string pattern: NaturalLanguageProcessing$TextPatternTable["EmailAddress"] StringPosition[str, %] How that is converted to regular expressions I'm not sure, I'll leave that puzzle to other people: NaturalLanguageProcessing$TextPatternTable["EmailAddress"] StringPatternPatternConvert[%] {(?ms)(?:(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}_%+\-])+\.{0,1})+@(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}.\-])+\.(?:[[:alpha:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}]){2,4},{},{},Hold[None]} 
Posted 1 year ago
 Would splitting the string be acceptable? Flatten[TextCases[#, "EmailAddress"] & /@ StringSplit["Lab_Wolfram_Interest_Group "], 1] 
Posted 1 year ago
 This is a strange but annoying problem! I noticed you can do a stack trace. Is this the pattern used to find mail addresses?So when I even use the standard StringCases I run into the same max recursion problem. I would think this must be a bug..? str = "Lab_Wolfram_Interest_Group "; StringCases[str, ((WordCharacter | "_" | "%" | "+" | "-") ..~~Repeated[Verbatim["."], {0, 1}]) ..~~Verbatim["@"] ~~WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False] 
Posted 1 year ago
 I didn't use the stack trace actually, that would be much easier haha. I did it in another way, anyhow, yes that is the email-pattern. The patterns for various things can be found here: NaturalLanguageProcessing$TextPatternTable // Keys // Sort // Column NaturalLanguageProcessingPackageScope$WordSplitterPatterns // Keys // Sort // Column both of them are associations with a bunch of patterns as values.
Posted 1 year ago
 Hi Sander, great to know! btw.. How do you know all this stuff? can't find it in any book. You should write one :)
Posted 1 year ago
 Well, I don't know these things by heart, but you can do: Needs["GeneralUtilities"] PrintDefinitions@NaturalLanguageProcessingTextPositionPackagePrivateiTextPosition to see the internals of TextPosition and click on the various links you will see in that document and go deeper in to the code and so on...
Posted 1 year ago
 It seem to me the pattern used is "wrong" When I change it to f[x_] := StringCases[x, (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False] it seems to work like a charm.
Posted 1 year ago
 That would allow for email addresses starting with a . which is not allowed as far as I know...
Posted 1 year ago
 You're right. But the addresses have already been evaluated (otherwise I wouldn't have received them) But what about if I force a wordcharacter first. Gives the same result but resolves your point. f[x_] := StringCases[x, WordCharacter ~~ (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~ Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~ Repeated[LetterCharacter, {2, 4}], Overlaps -> False] `
Posted 1 year ago
 Well, I think now you can still have multiple . in a row; I think the original stringpattern (stringexpression) is correct, it is either the conversion to RegularExpression that goes wrong, or the evaluation of the regularexpression... I would call this a bug indeed, do you mind sending it in as product feedback?
Posted 1 year ago
 Hi Sander, I submitted a case. This is in my opinion a bug indeed. I'll post the outcome.