# Find email addresses in text using TextCases? "Recursion limit problem"

Posted 1 year ago
2188 Views
|
15 Replies
|
13 Total Likes
|
 Windows 10 MM11.1I'm mining through some email and would like to retrieve email addresses. Wanted to use the easy TexCases but it seems that when I use this it generates a recursion limit problem. With a lot of outlook mails you often see the following format: "Lab_Wolfram_Interest_Group " When I use TextCases TextCases["Lab_Wolfram_Interest_Group ", "EmailAddress"] RegularExpression::maxrec: Recursion limit exceeded; positive match might be missed. It seems that the string before the < creates the problem when we use an _ character but the length of the string seems also important. TextCases["abcdefg_hijklmnopqrst ", "EmailAddress"] generates an error.The following seems ok: TextCases["abcdefg_hijklmnopqrs ", "EmailAddress"] Any thoughts?
15 Replies
Sort By:
Posted 1 year ago
 I've looked at the code of TextCases a bit but can't say 100% sure what it does for the type "EmailAddress", however I do see that some types simply get routed to Interpreter: Interpreter[type][strings] If you try this: Interpreter["EmailAddress"]["Lab_Wolfram_Interest_Group "] "Lab_Wolfram_Interest_Group@groups.wolfram.com" it does work.
Posted 1 year ago
 If you follow the command with your input: str="Lab_Wolfram_Interest_Group " TextCases[str,"EmailAddress"] calls: TextCases[str,{"EmailAddress"}] which calls: NaturalLanguageProcessingiTextCases[str,{"EmailAddress"}->"String"] which calls: TextPosition[str,{"EmailAddress"}] 
Posted 1 year ago
 And after some deep digging, you get the string pattern: NaturalLanguageProcessing$TextPatternTable["EmailAddress"] StringPosition[str, %] How that is converted to regular expressions I'm not sure, I'll leave that puzzle to other people: NaturalLanguageProcessing$TextPatternTable["EmailAddress"] StringPatternPatternConvert[%] {(?ms)(?:(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}_%+\-])+\.{0,1})+@(?:[[:alnum:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}.\-])+\.(?:[[:alpha:]\x{f6b2}-\x{f6b5}\x{f6b7}\x{f6b9}-\x{f6bc}\x{f6be}-\x{f6bf}\x{f6c1}-\x{f700}\x{f730}-\x{f731}\x{f770}\x{f772}-\x{f773}\x{f776}\x{f779}-\x{f77a}\x{f77d}-\x{f780}\x{f782}-\x{f78b}\x{f78d}-\x{f790}\x{f793}-\x{f79a}\x{f79c}-\x{f7a2}\x{f7a4}-\x{f7bd}\x{f800}-\x{f844}\x{f846}-\x{f84c}\x{f854}-\x{f86c}\x{f874}-\x{f875}\x{f878}-\x{f879}\x{f87d}-\x{f886}\x{f88a}]){2,4},{},{},Hold[None]} 
Posted 1 year ago
 Would splitting the string be acceptable? Flatten[TextCases[#, "EmailAddress"] & /@ StringSplit["Lab_Wolfram_Interest_Group "], 1] 
Posted 1 year ago
 This is a strange but annoying problem! I noticed you can do a stack trace. Is this the pattern used to find mail addresses?So when I even use the standard StringCases I run into the same max recursion problem. I would think this must be a bug..? str = "Lab_Wolfram_Interest_Group "; StringCases[str, ((WordCharacter | "_" | "%" | "+" | "-") ..~~Repeated[Verbatim["."], {0, 1}]) ..~~Verbatim["@"] ~~WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False] 
Posted 1 year ago
 I didn't use the stack trace actually, that would be much easier haha. I did it in another way, anyhow, yes that is the email-pattern. The patterns for various things can be found here: NaturalLanguageProcessing$TextPatternTable // Keys // Sort // Column NaturalLanguageProcessingPackageScope$WordSplitterPatterns // Keys // Sort // Column both of them are associations with a bunch of patterns as values.
Posted 1 year ago
 Hi Sander, great to know! btw.. How do you know all this stuff? can't find it in any book. You should write one :)
Posted 1 year ago
 Well, I don't know these things by heart, but you can do: Needs["GeneralUtilities"] PrintDefinitions@NaturalLanguageProcessingTextPositionPackagePrivateiTextPosition to see the internals of TextPosition and click on the various links you will see in that document and go deeper in to the code and so on...
Posted 1 year ago
 It seem to me the pattern used is "wrong" When I change it to f[x_] := StringCases[x, (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~Repeated[LetterCharacter, {2, 4}], Overlaps -> False] it seems to work like a charm.
Posted 1 year ago
 That would allow for email addresses starting with a . which is not allowed as far as I know...
Posted 1 year ago
 You're right. But the addresses have already been evaluated (otherwise I wouldn't have received them) But what about if I force a wordcharacter first. Gives the same result but resolves your point. f[x_] := StringCases[x, WordCharacter ~~ (WordCharacter | "_" | "%" | "+" | "-" | ".") .. ~~ Verbatim["@"] ~~ (WordCharacter | "." | "-") .. ~~ Verbatim["."] ~~ Repeated[LetterCharacter, {2, 4}], Overlaps -> False] `
Posted 1 year ago
 Well, I think now you can still have multiple . in a row; I think the original stringpattern (stringexpression) is correct, it is either the conversion to RegularExpression that goes wrong, or the evaluation of the regularexpression... I would call this a bug indeed, do you mind sending it in as product feedback?
Posted 1 year ago
 Hi Sander, I submitted a case. This is in my opinion a bug indeed. I'll post the outcome.