Message Boards Message Boards

Automatically-generated timelines

enter image description here

Inspired by the recent Wolfram Blog post, I created a program that automatically generates timelines based on mentions of notable events in Wikipedia articles. Try it out below:

generateAutoTimeline[text_String] := 
  yearsentences = 
      RegularExpression[".*?(?:I|i)n (\\d{4}).*?\\."]] &]; 
         StringCases[#, RegularExpression["\\d{4}"]][[1]]}], #] & /@ 

generateAutoTimeline[WikipediaData["Natural language processing"]]

Timeline of notable events in natural language processing

POSTED BY: Jesse Friedman
6 Replies

I just recently used this again, such a useful function! I needed to scan a long historic Wikipedia article and focus on dates only. This did all job for me, worked like a charm. Two minor suggestions:

  • to inherit options of TimelinePlot, something like PlotLayout -> "Vertical" would work;
  • to add to Wolfram Function Repository :-)

Thanks, Jesse, this idea already saved me a lot of time.

POSTED BY: Vitaliy Kaurov

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial column Staff Picks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: Moderation Team

Very cool idea! And it works for topics you might not expect it to be useful for:

frog timeline screenshot

This is very neat idea Jesse. Do you mean that

Select[TextSentences[nlp], StringMatchQ[#, RegularExpression[".*?(?:I|i)n (\\d{4}).*?\\."]] &]

is more robust in finding sentences with year dates then given in blog method

TextCases[nlp, Containing["Sentence", "Number"]]

I am not that familiar with regex. Could you explain briefly how to read

RegularExpression[".*?(?:I|i)n (\\d{4}).*?\\."]
POSTED BY: Sam Carrettie

I started out with TextCases, but (at least for me) it runs really slowly on even a relatively small text, like the Wikipedia article. I think this is because it has to connect to the cloud to use semantic interpretations of numbers like "two thousand and four." For me, regex is much, much faster and more adaptable.

Here's a breakdown of the regex:

  • .*? means "match as many characters as you can, but not any more than necessary." It should work with just ".*"; the lazy quantifier is a holdover from when I was fine-tuning the regex.
  • (?:I|i) means "match either capital I or lowercase i." The "?:" is just a formality, preventing the creation of a capture group.
  • character n
  • character [space]
  • (\\d{4}) means "match four digits." The actual code for a digit is "\d", but it has to be escaped in a Wolfram Language string.
  • *.?** again
  • \\. means "match the character [period]." The period has to be escaped as it's a regex character, as does the slash, since otherwise the Wolfram Language will think I want to insert a special character."

I could probably get away with just ".*(I|i)n \d{4}.*" for the regex, but I needed the other parts for previous iterations of the code and never bothered to take them out.

A Wolfram Language pattern translation of the above is "I" | "i" ~~ "n" ~~ Repeated[DigitCharacter, {4}].

POSTED BY: Jesse Friedman

Thanks, Jesse, very instructive !

POSTED BY: Vitaliy Kaurov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract