Group Abstract

Message Boards

WOLFRAM COMMUNITY

10.2K Views

4 Replies

6 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

[WSC18] Chronological Dating of Historical Texts Using RNNs

Tarek Aloui

Posted 7 years ago

Abstract Chronological dating is essential for various tasks such as summarization and document retrieval. This project proposes a novel method of dating historical texts using a Recurrent Neural Network that works both on the character level and the word level. The results show a significant improvement in the accuracy of detection compared to using a word level only RNN. The error span is between 1 year and a century for most cases. Though it achieved a decent performance for the texts originating from the 19th century, the accuracy declines significantly for older texts due to their scarcity and the non-homogeneous distribution of the provided dataset. Data Collection and Pre-processing The training data is composed of public domain books collected from Openlibrary, an online project created by Aaron Swartz, Brewster Kahle and others. Wolfram Language supports a Service Connect that allows a direct interaction with the Openlibrary API. Data Collection In[106]:= openlibrary = ServiceConnect["OpenLibrary"] Out[106]= ServiceObject["OpenLibrary", "ID" -> "connection-1f55c291dcb5feaa290ece0cd1c97ed2"] In[107]:= BookDatabase = <\|\|>; nbCalls = 0; In[109]:= GetTextRequest[keys_] := {nbCalls++; Normal@openlibrary["BookText", {"BibKeys" -> {keys}}]} In[110]:= GetValidTextKey[keys_] := SelectFirst[keys, MatchQ[Pause[.1]; Normal@openlibrary["BookText", {"BibKeys" -> {#}}], KeyValuePattern[_ -> _String]] &]; GetFirstText[list_] := FirstCase[list, Except["NotAvailable", _String]] In[112]:= GetTexts [keys_] := Quiet[GetFirstText[ Values[Normal@ openlibrary["BookText", {"BibKeys" -> RandomSample[keys, UpTo[50]]}]]]] In[113]:= AddBook[b_] := BookDatabase[b["FirstPublishYear"]] = If[MatchQ[BookDatabase[b["FirstPublishYear"]], _Missing], {GetTexts[b["EditionKey"]]}, Append[BookDatabase[b["FirstPublishYear"]], GetTexts[b["EditionKey"]]] ] In[114]:= AddSubject[subj_String] := Module[{searchResults}, (Database init) BookDatabase = <\|\|>; (Searching books) searchResults = Select[Normal@ openlibrary["BookSearch", {"Subject" -> subj, MaxItems -> 90}], #["HasFulltext"] &]; (Downloading Text) GeneralUtilities`MonitoredMap[AddBook, searchResults]; Print[subj <> " DOWNLOADED!"]; (Exporting) Export["C:\\Users\\Tarek\\OneDrive\\Documents\\Portfolio\\opportunities\\\ Wolfram Summer Camp\\Dating Historical Texts\\" <> subj <> ".wxf", BookDatabase]; Pause[180]; ] (TESTING) In[115]:= AddSubject /@ {"Religion", "Games", "Drama", "Action", "Adventure", "Horror", "Spirituality", "Poetry", "Fantasy"} During evaluation of In[115]:= Religion DOWNLOADED! During evaluation of In[115]:= Games DOWNLOADED! During evaluation of In[115]:= Drama DOWNLOADED! During evaluation of In[115]:= Action DOWNLOADED! During evaluation of In[115]:= Adventure DOWNLOADED! During evaluation of In[115]:= Horror DOWNLOADED! During evaluation of In[115]:= Spirituality DOWNLOADED! During evaluation of In[115]:= Poetry DOWNLOADED! During evaluation of In[115]:= Fantasy DOWNLOADED! Training the Neural Net The project uses a hybrid Word-level and Character-level Recurrent Neural Network. The word-level processing is built on the GloVe model in order to compute vector representations for words. The limitation to using a Word-level only network is that most of old books include words that were not included in the training data for GloVe. Thus, adding a character-level Network Chain seems to improve the prediction accuracy since it helps process previously unseen corpora. Define the net In[24]:= net = NetGraph[<\| "chars" -> { UnitVectorLayer[], LongShortTermMemoryLayer[50], DropoutLayer[.5] }, "words" -> { NetModel[ "GloVe 100-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 \ Data"], LongShortTermMemoryLayer[50], SequenceLastLayer[], DropoutLayer[.5] }, "cat" -> CatenateLayer[], "predict" -> { LongShortTermMemoryLayer[100], SequenceLastLayer[], DropoutLayer[.5], LinearLayer[1] } \|>, { NetPort["Characters"] -> "chars", NetPort["Words"] -> "words", {"chars", "words"} -> "cat", "cat" -> "predict" -> NetPort["Date"] }, "Characters" -> NetEncoder[{"Characters", characters}], "Date" -> NetDecoder["Scalar"] ]; Create training data In[32]:= sample[text_String, n_: 1024] := Module[{len, offset}, len = StringLength@text; offset = RandomInteger[{1, len - n - 1}]; StringPadRight[ charPreprocess@ StringTake[text, {Max[1, offset], Min[len, offset + n - 1]}], n] ]; In[33]:= getSample[KeyValuePattern[{"FullText" -> text_String, "FirstPublishYear" -> d_DateObject}]] := With[{s = sample[text]}, <\|"Characters" -> s, "Words" -> s, "Date" -> dateToNum@d\|> ]; $samples = 100000; In[43]:= import = Flatten[Import /@ FileNames[".wxf", $dataDir, Infinity]]; withDate = Cases[import, KeyValuePattern["FirstPublishYear" -> _DateObject]]; trainingData = RandomSample[ Flatten@Table[getSample /@ withDate, Ceiling[$samples/Length[withDate]]], UpTo[$samples]]; Length@trainingData Training results = NetTrain[ net, trainingData, All, ValidationSet -> Scaled[.25], TargetDevice -> "GPU", MaxTrainingRounds -> Quantity[8, "Hours"], BatchSize -> 48, TrainingProgressCheckpointing -> {"Directory", $dataDir <> "Trained_Networks\\", "Interval" -> Quantity[15, "Minutes"]} ]; In[66]:= trained = results["TrainedNet"]; Save trained net In[67]:= Export["PredictTextDate.wlnet", trained] Out[67]= "PredictTextDate.wlnet" Testing Testing and Results In[25]:= CalculateAccuracy[title_String] := Module[{text, predDate, actualDate}, text = processForInput[sample[ResourceData[title]]]; actualDate = ResourceObject[title]["SourceMetadata"]["Date"]; predDate = numToDate[net[text]]; {IntegerPart[ Abs[UnitConvert[DateDifference[actualDate, predDate], "Years"]]], actualDate, DateObject[predDate, "Year"]} ] In[50]:= titleList = {"Friends, Romans, Countrymen", "On the Origin of Species", "Agnes Grey", "Alice in Wonderland", "The Pickwick Papers", "The Wheels of Chance", "Pellucidar", "The Adventures of Huckleberry Finn", "The Emerald City of Oz", "The Old Curiosity Shop", "Adam Bede", "A Study in Scarlet", "Micah Clarke", "Prufrock"}; In[51]:= accuracyList = CalculateAccuracy /@ titleList; In[52]:= resultsTable = Dataset@SortBy[ Join[{{"Error", "Actual Date", "Predicted Date"}}, accuracyList], #[[2]] &]; In[53]:= meanAccuracy = N@Mean@accuracyList[[All, 1]] Out[53]= Quantity[25.8571, "Years"] Want to test it out? We have launched a micro-site that implements the current neural network architecture in order to allow the prediction of publication dates of an input text. Link: Dating Historical Sites Microsite You can also try testing the code using the Wolfram Code below. Link: Download Code Acknowledgements This project could not have been accomplished without the support, encouragement and insight of my mentor: Mr. Richard Hennigan. Attachments:* Chronological Da...nb

Automatic Chronological Dating of Historical Texts Using RNNs

Abstract

Chronological dating is essential for various tasks such as summarization and document retrieval. This project proposes a novel method of dating historical texts using a Recurrent Neural Network that works both on the character level and the word level. The results show a significant improvement in the accuracy of detection compared to using a word level only RNN. The error span is between 1 year and a century for most cases. Though it achieved a decent performance for the texts originating from the 19th century, the accuracy declines significantly for older texts due to their scarcity and the non-homogeneous distribution of the provided dataset.

Data Collection and Pre-processing

The training data is composed of public domain books collected from Openlibrary, an online project created by Aaron Swartz, Brewster Kahle and others. Wolfram Language supports a Service Connect that allows a direct interaction with the Openlibrary API.

Data Collection

In[106]:= openlibrary = ServiceConnect["OpenLibrary"]

Out[106]= ServiceObject["OpenLibrary", 
 "ID" -> "connection-1f55c291dcb5feaa290ece0cd1c97ed2"]

In[107]:= BookDatabase = <||>;
nbCalls = 0;

In[109]:= GetTextRequest[keys_] := {nbCalls++; 
  Normal@openlibrary["BookText", {"BibKeys" -> {keys}}]}

In[110]:= GetValidTextKey[keys_] := 
  SelectFirst[keys, 
   MatchQ[Pause[.1]; Normal@openlibrary["BookText", {"BibKeys" -> {#}}], 
     KeyValuePattern[_ -> _String]] &];
GetFirstText[list_] := FirstCase[list, Except["NotAvailable", _String]]

In[112]:= GetTexts [keys_] := 
 Quiet[GetFirstText[
   Values[Normal@
     openlibrary["BookText", {"BibKeys" -> RandomSample[keys, UpTo[50]]}]]]]

In[113]:= AddBook[b_] := 
 BookDatabase[b["FirstPublishYear"]] = 
  If[MatchQ[BookDatabase[b["FirstPublishYear"]], _Missing],
   {GetTexts[b["EditionKey"]]},
   Append[BookDatabase[b["FirstPublishYear"]], GetTexts[b["EditionKey"]]]
   ]

In[114]:= AddSubject[subj_String] :=
 Module[{searchResults},
  (*Database init*)
  BookDatabase = <||>;
  (*Searching books*)
  searchResults = 
   Select[Normal@
     openlibrary["BookSearch", {"Subject" -> subj, MaxItems -> 90}],
    #["HasFulltext"] &];
  (*Downloading Text*)
  GeneralUtilities`MonitoredMap[AddBook, searchResults];
  Print[subj <> " DOWNLOADED!"];
  (*Exporting*)
  Export["C:\\Users\\Tarek\\OneDrive\\Documents\\Portfolio\\opportunities\\\
Wolfram Summer Camp\\Dating Historical Texts\\" <> subj <> ".wxf", 
   BookDatabase];
  Pause[180];
  ]

(*TESTING*)

In[115]:= AddSubject /@ {"Religion", "Games", "Drama", "Action", "Adventure", "Horror", 
  "Spirituality", "Poetry", "Fantasy"}


During evaluation of In[115]:= Religion DOWNLOADED!

During evaluation of In[115]:= Games DOWNLOADED!

During evaluation of In[115]:= Drama DOWNLOADED!

During evaluation of In[115]:= Action DOWNLOADED!

During evaluation of In[115]:= Adventure DOWNLOADED!

During evaluation of In[115]:= Horror DOWNLOADED!

During evaluation of In[115]:= Spirituality DOWNLOADED!

During evaluation of In[115]:= Poetry DOWNLOADED!

During evaluation of In[115]:= Fantasy DOWNLOADED!

Training the Neural Net

RNN Architecture

The project uses a hybrid Word-level and Character-level Recurrent Neural Network. The word-level processing is built on the GloVe model in order to compute vector representations for words. The limitation to using a Word-level only network is that most of old books include words that were not included in the training data for GloVe. Thus, adding a character-level Network Chain seems to improve the prediction accuracy since it helps process previously unseen corpora.

Define the net

In[24]:= net = NetGraph[<|
   "chars" -> {
     UnitVectorLayer[],
     LongShortTermMemoryLayer[50],
     DropoutLayer[.5]
     },
   "words" -> {
     NetModel[
      "GloVe 100-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 \
Data"],
     LongShortTermMemoryLayer[50],
     SequenceLastLayer[],
     DropoutLayer[.5]
     },
   "cat" -> CatenateLayer[],
   "predict" -> {
     LongShortTermMemoryLayer[100],
     SequenceLastLayer[],
     DropoutLayer[.5],
     LinearLayer[1]
     }
   |>,
  {
   NetPort["Characters"] -> "chars",
   NetPort["Words"] -> "words",
   {"chars", "words"} -> "cat",
   "cat" -> "predict" -> NetPort["Date"]
   },
  "Characters" -> NetEncoder[{"Characters", characters}],
  "Date" -> NetDecoder["Scalar"]
  ];

Create training data

In[32]:= sample[text_String, n_: 1024] :=
  Module[{len, offset},
   len = StringLength@text;
   offset = RandomInteger[{1, len - n - 1}];
   StringPadRight[
    charPreprocess@
     StringTake[text, {Max[1, offset], Min[len, offset + n - 1]}], n]
   ];

In[33]:= getSample[KeyValuePattern[{"FullText" -> text_String, 
     "FirstPublishYear" -> d_DateObject}]] :=
  With[{s = sample[text]},
   <|"Characters" -> s, "Words" -> s, "Date" -> dateToNum@d|>
   ];

$samples = 100000;

In[43]:= import = Flatten[Import /@ FileNames["*.wxf", $dataDir, Infinity]];

withDate = Cases[import, KeyValuePattern["FirstPublishYear" -> _DateObject]];
trainingData = 
  RandomSample[
   Flatten@Table[getSample /@ withDate, Ceiling[$samples/Length[withDate]]], 
   UpTo[$samples]];
Length@trainingData

Training

results = NetTrain[
  net,
  trainingData,
  All,
  ValidationSet -> Scaled[.25],
  TargetDevice -> "GPU",
  MaxTrainingRounds -> Quantity[8, "Hours"],
  BatchSize -> 48,
  TrainingProgressCheckpointing -> {"Directory", $dataDir <> 
     "Trained_Networks\\", "Interval" -> Quantity[15, "Minutes"]}
  ];

In[66]:= trained = results["TrainedNet"];

Save trained net

In[67]:= Export["PredictTextDate.wlnet", trained]

Out[67]= "PredictTextDate.wlnet"

Testing

Testing and Results

In[25]:= CalculateAccuracy[title_String] := Module[{text, predDate, actualDate},
  text = processForInput[sample[ResourceData[title]]];
  actualDate = ResourceObject[title]["SourceMetadata"]["Date"];
  predDate = numToDate[net[text]];
  {IntegerPart[
    Abs[UnitConvert[DateDifference[actualDate, predDate], "Years"]]], 
   actualDate, DateObject[predDate, "Year"]}
  ]

In[50]:= titleList = {"Friends, Romans, Countrymen", "On the Origin of Species", 
   "Agnes Grey", "Alice in Wonderland", "The Pickwick Papers", 
   "The Wheels of Chance", "Pellucidar", 
   "The Adventures of Huckleberry Finn", "The Emerald City of Oz", 
   "The Old Curiosity Shop", "Adam Bede", "A Study in Scarlet", 
   "Micah Clarke", "Prufrock"};

In[51]:= accuracyList = CalculateAccuracy /@ titleList;

In[52]:= resultsTable = 
 Dataset@SortBy[
   Join[{{"Error", "Actual Date", "Predicted Date"}}, accuracyList], #[[2]] &];

In[53]:= meanAccuracy = N@Mean@accuracyList[[All, 1]]

Out[53]= Quantity[25.8571, "Years"]

Want to test it out?

Dating Historical Texts Microsite We have launched a micro-site that implements the current neural network architecture in order to allow the prediction of publication dates of an input text. Link: Dating Historical Sites Microsite

You can also try testing the code using the Wolfram Code below. Link: Download Code

Acknowledgements

This project could not have been accomplished without the support, encouragement and insight of my mentor: Mr. Richard Hennigan.

POSTED BY: Tarek Aloui

4 Replies

Sort By:

George Woodrow III

George Woodrow III, lifelong learner

Posted 7 years ago

I know about the time constraints for the project. A remarkable job. My choice of text was random -- just from the bookshelf next to my desk, and of course, being a translation, it will be harder to identify the date of writing. Thanks for pointing me to the Open Library. It looks like it might be useful for out-of-print books, among other things. I miss having access to a University library.

POSTED BY: George Woodrow III

Tarek Aloui

Posted 7 years ago

Thank you for your feedback! This project was conducted in 2 weeks. It is not complete and it has a lot of room for improvement. I did not have enough time to play with the hyper-parameters and the architecture itself. The dataset used is also not optimal, as it is not evenly distributed across the centuries. Again, I appreciate the fact that you tested this version. I will be working on improving the DNN's accuracy in the near future.

POSTED BY: Tarek Aloui

George Woodrow III

George Woodrow III, lifelong learner

Posted 7 years ago

What kind of test do you need? I just tried a 50 word sample from the Gligamesh epic, including the proper name "enkidu", and got a date of 1862, which is off by a couple millennia. It would be interesting and useful if the micro site could provide some range of dates.

POSTED BY: George Woodrow III

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 7 years ago

- Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback