Message Boards Message Boards

[WSC 2018] Finding the Author of a Text using Machine Learning

Posted 6 years ago

Abstract

Getting computers to understand non-quantitative things, like the style of an author or the emotions shown on a human face, is a challenge and is still being explored today. I used a recurrent neural network that works on the character level, which had by far the best accuracy of the available methods. It takes in a string from the works of one of ten preselected authors and outputs which author it thinks wrote the text. While there is a decent level of accuracy for short ~4 sentence snippets, there is a much higher error rate for snippets less than 3 sentences long.

Data Compilation

I used the text category in the Wolfram Data Repository to determine which authors would be best to start with. I started with the authors who had the most data in the repository, including William Shakespeare and Charles Dickens. I then worked through and imported books from all the big-name authors with more than 3 works in the repository. I did some importing from Project Gutenberg and from Stephen Wolfram's website to fill out some of my data. fullContent = Block[{PrintTemporary}, <| "William Shakespeare" -> StringRiffle[ResourceData["Shakespeare's Sonnets"], Import[startDir <> "RandJ.txt"], Import[startDir <> "ShakespeareMacbeth.txt"], Import[startDir <> "Tempest.txt"], ResourceData["Hamlet"], "\n\n"], "Mark Twain" -> StringRiffle[ResourceData["The Adventures of Huckleberry Finn"], ResourceData["The Adventures of Tom Sawyer"], ResourceData["A Connecticut Yankee in King Arthur's Court"], "\n\n"], "Arthur Conan Doyle" -> StringRiffle[ResourceData["Micah Clarke"], ResourceData["A Study in Scarlet"], ResourceData["The Valley of Fear"], ResourceData["The Lost World"], ResourceData["The Mystery of Cloomber"], ResourceData["The White Company"], ResourceData["The Poison Belt"], "\n\n"], "Charles Dickens" -> StringRiffle[ResourceData["The Pickwick Papers"], ResourceData["Dombey and Son"], ResourceData["Little Dorrit"], ResourceData["The Chimes"], ResourceData["The Cricket on the Hearth: A Fairy Tale of Home"], ResourceData["The Old Curiosity Shop"], ResourceData["The Life and Adventures of Nicholas Nickleby"], ResourceData[ "The Haunted Man and the Ghost's Bargain: A Fancy for \ Christmas\[Hyphen]Time"], "\n\n"], "Joseph Conrad" -> StringRiffle[ResourceData["The Arrow of Gold"], ResourceData["The Secret Sharer"], ResourceData["Nostromo"], ResourceData["Lord Jim"], "\n\n"], "Frances Hogdson Burnett" -> StringRiffle[ResourceData["A Little Princess"], ResourceData["The Secret Garden"], ResourceData["The Head of the House of Coombe"], "\n\n"], "Jane Austen" -> StringRiffle[ResourceData["Mansfield Park"], ResourceData["Pride and Prejudice"], ResourceData["Sense and Sensibility"], ResourceData["Northanger Abbey"], ResourceData["Persuasion"], "\n\n"], "Virginia Woolf" -> StringRiffle[ResourceData["Jacob's Room"], ResourceData["The Voyage Out"], ResourceData["Godfrey Morgan"], "\n\n"], "Stephen Wolfram" -> StringRiffle[ResourceData["Full Text of A New Kind of Science"], Import[startDir <> "text.txt"], "\n\n"], "H.G. Wells" -> StringRiffle[ResourceData["The Wheels of Chance"], ResourceData["Mr. Britling Sees It Through"], ResourceData["The Country of the Blind"], ResourceData["Kipps, the Story of a Simple Soul"], ResourceData["Love and Mr. Lewisham"], ResourceData["The History of Mr. Polly"], ResourceData["The Invisible Man"], ResourceData["The Stolen Bacillus and Other Incidents"], ResourceData["The War of the Worlds"], ResourceData["The World Set Free"], ResourceData["The First Men in the Moon"], "\n\n"] |>];

Methods of Classifying

Classify Function

At first, I tried using the Classify function, which assigned me the Markov Model option by default. The accuracy was worse than random chance if the classifier was given less than 1/4 a book. I tried other options, like Gradient Boosted Trees and Logistic Regression, but the accuracy was comparable to the Markov Model.

Neural Net

With the help of my mentor, I built a character-level recurrent neural network that used a Unit Vector Layer, 3 Gated Recurrent Layers, a SequenceLast Layer, and a Softmax Layer. The neural net was built with the intention of getting the neural net to recognize each character in context and detect patterns in the style of a certain author.

net = NetInitialize@NetChain[{
    UnitVectorLayer[],
    GatedRecurrentLayer[128],
    GatedRecurrentLayer[128],
    GatedRecurrentLayer[128],
    SequenceLastLayer[],
    LinearLayer[],
    SoftmaxLayer[]
    },
   "Input" -> NetEncoder[{"Characters", characters}],
   "Output" -> NetDecoder[{"Class", classes}]
   ]

Training the Neural Net

I used a data set of 10,000 random 256-character samples generated by the following methods getSample[text_, blockSize_: 256] := Module[{length, offset}, length = StringLength[text]; offset = RandomInteger[{1, length - blockSize}]; StringTake[text, {offset, offset + blockSize - 1}] ]; getTrainingData[count_] := RandomSample[ Flatten@Table[KeyValueMap[getSample[#2] -> #1 &, fullContent], Ceiling[count/6]], count]; I had a validation error slightly higher than my error, which may be a sign of overfitting, but testing with non-training data proved that there was not much overfitting happening. My final error was about 10% and my final validation error was about 15%. With fewer authors and more data per author, the error rate dropped below 5%, as well as the validation error. results = NetTrain[net, trainSet, BatchSize -> 64, ValidationSet -> testSet, MaxTrainingRounds -> 50, TargetDevice -> "CPU"]

Future Work

This project has a lot of room to grow. In the future, I would like to add more authors and add more literature for each author. I would also like to create a balanced data set that does not have am an imbalance of text for some authors compared to others. While I used a character-level neural network for my project, I would like to build a word-level neural net or a hybrid character-level and word-level neural net to see how effective that is in comparison.

Acknowledgements

I would like to thank my mentor Douglas Smith, as well as Michael Kaminsky and Richard Hennigan for all the guidance and help they have given me while I worked on this project.

Microsite

Try out my neural net on this microsite Predicting the Author of a Text.

POSTED BY: Nihar Shah
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract