Group Abstract Group Abstract

Message Boards Message Boards

[WSS18] Punctuation Restoration With Recurrent Neural Networks

Attachments:
POSTED BY: Mengyi Shan
11 Replies

Question. I have a bunch of YouTube videos (on Wolfram Language and its use in legal studies). One can obtain a sort-of transcript of each of the videos from within YouTube and download those transcripts. But the transcripts contain no punctuation. They have, for example, no periods. So, they are hard to read. Is there a trained version of your network in WLNet form that one could import and then set it to work punctuating my transcripts? I am aware that the result will not be great. But it is likely to be A LOT better than the gob of unpunctuated text I currently possess. And it will (I hope) be easy to use. If there is a WLNet file on GitHub or somewhere, could you please let me know how I can obtain it.

POSTED BY: Seth Chandler
Posted 7 years ago

Thank you very much for this posting. Have been waiting for over a decade for this kind of MMA application. What I wish is a pdf book to teach using MMA on texts. Rock on.

POSTED BY: Andrew Meit

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD

Dear Mengyi,

I have just completed training your network on a data set that is about 10 times larger than yours (and might have a better overall quality, because I did not use Wikipedia anymore).

This is the network I used:

net4 = NetChain[{
    embeddingLayer,
    ElementwiseLayer[Tanh],
    LongShortTermMemoryLayer[100],
    NetBidirectionalOperator[{LongShortTermMemoryLayer[40], 
      GatedRecurrentLayer[40]}],
    NetBidirectionalOperator[{GatedRecurrentLayer[30], 
      LongShortTermMemoryLayer[30]}],
    LongShortTermMemoryLayer[30],
    NetBidirectionalOperator[{LongShortTermMemoryLayer[10], 
      GatedRecurrentLayer[10]}],
    LongShortTermMemoryLayer[10],
    NetMapOperator[LinearLayer[3]], 
    SoftmaxLayer["Input" -> {"Varying", 3}]}, 
   "Output" -> NetDecoder[{"Class", {"a", "b", "c"}}]
   ];

And here are the results:

enter image description here

It turns out that all values got somewhat better, especially the comma values gained.

My actual corpus is at least 5-10 times larger than what I have been using, but I will need another GPU setup and/or other tricks to be able to run it on the entire corpus (10000 books of Gutenberg + other texts).

I am happy to make the trained network (163 MB) available upon request. If you are outside Germany, you can also easily download the Gutenberg dump.

Cheers, Marco

POSTED BY: Marco Thiel

Hi Marco,

The problem is, for example, you have a piece of text like "word1 ,,, word2", the tagging function will recognize them as "word1", ",,,", "word2" three words, and give them three tags, "c", "b", "c". But the embedding layer will only recognize two words "word1" and "word2". This is the reason why there are only 199 words.

One possible solution is to check if there are patterns like ",,", "..", " .", " ," in your original text. And I'll revise my toPureText function to try to avoid this situation.

Thanks again for helping me find this bug!! :)

Mengyi

POSTED BY: Mengyi Shan

This is weird. As far as I can tell my data looks just like yours. But it must be something about the data; if I use the net and just NetInitialize it, everything seems to work fine. So it must be about the training.

Now, I am getting somewhere. It turns out that if I delete the Validation set and only use the first 358 test dataset it is running. The set 359 is offending. In the dataset the string has 199 words and there are 200 letters in the classification set.

If I clean this up by

trainingSetclean = Select[trainingSet, (Length[StringSplit[#[[1]]]] == 200 && Length[#[[2]]] == 200) &];

and

validationSetclean = Select[validationSet, (Length[StringSplit[#[[1]]]] == 200 && Length[#[[2]]] == 200) &];

it appears to work.

Thank you for helping to solve this.

I'll report back if I find get anything interesting.

Cheers,

Marco

POSTED BY: Marco Thiel

This suggests a problem with the pre-processing step, that is, the step of deleting other characters and making the text "clean". Since the "toPureText" function is written based on my own data set, using some different training data might produce bugs.

Here is my example with a mini data set of one paragraph. If it still doesn't work for you, you can send me your corpus and I'll check what's happening :)

test result

POSTED BY: Mengyi Shan

Hi,

I must be overlooking something here. I think that I have got your code (with trivial modifications) and I always get warnings when I execute

NetTrain[net4, trainingSet, All, ValidationSet -> validationSet, TargetDevice -> "GPU"]

The warning is:

enter image description here

Can you post one training set data element? It is late here so I am probably not seeing something trivial. The same message occurs with all networks you suggest.

Cheers,

Marco

POSTED BY: Marco Thiel

I get a dramatic speed improvement when I do this. On typical networks the time would go down from say 9 hours to minutes. That will offer many new possibilities for modifications of networks and datasets.

I am just running your code (with some minor modifications and additions; e.g. rawText does not seem to be defined etc), but downloading the Wikipedia articles will take 60+ minutes. I hope that the GPU will make up for that.

I also have access to some relatively large text corpus (including but not limited to a 10000 book dump that the project Gutenberg offers/offered), so I might use that to train later on - after trying to reproduce your results.

Thanks again for posting this great article,

Marco

POSTED BY: Marco Thiel

Hello Marco,

Thanks for your reply! I ran that on my laptop for a whole night, approximately 9 hours. Actually, I'm now working on a 10 times larger dataset and plan to run it with GPU. I believe this will produce a great improvement.

Mengyi

POSTED BY: Mengyi Shan

This is really useful. I have several programs to transcribe texts within the Wolfram Language/Mathematica, but text segmentation/punctuation is always a problem. It appears that getting larger datasets should not be a problem in this case.

It appears that your NetTrain does not use GPU acceleration. How long did the training take on what type of machine?

Best wishes,

Marco

POSTED BY: Marco Thiel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard