I started out with TextCases, but (at least for me) it runs really slowly on even a relatively small text, like the Wikipedia article. I think this is because it has to connect to the cloud to use semantic interpretations of numbers like "two thousand and four." For me, regex is much, much faster and more adaptable.
Here's a breakdown of the regex:
- .*? means "match as many characters as you can, but not any more than necessary." It should work with just ".*"; the lazy quantifier is a holdover from when I was fine-tuning the regex.
- (?:I|i) means "match either capital I or lowercase i." The "?:" is just a formality, preventing the creation of a capture group.
- character n
- character [space]
- (\\d{4}) means "match four digits." The actual code for a digit is "\d", but it has to be escaped in a Wolfram Language string.
- *.?** again
- \\. means "match the character [period]." The period has to be escaped as it's a regex character, as does the slash, since otherwise the Wolfram Language will think I want to insert a special character."
I could probably get away with just ".*(I|i)n \d{4}.*" for the regex, but I needed the other parts for previous iterations of the code and never bothered to take them out.
A Wolfram Language pattern translation of the above is "I" | "i" ~~ "n" ~~ Repeated[DigitCharacter, {4}].