Message Boards Message Boards

Large corpus of parsed sentences in English

Posted 8 years ago

Hello, for a project in visual scene understanding I'm looking for a large corpus with parsed sentences from text (books, etc), from which I could extract Subject-Verb-Object (SVO) statistics. I would like to know for example frequencies of SVO triplets such as boy-play-girl in 'The boy in white is playing with the nice girl'. Thank you.

POSTED BY: Yair Lakretz
4 Replies
Posted 8 years ago

Marco, Thanks a lot for the detailed answer. This is really helpful!

POSTED BY: Yair Lakretz

Hi,

I am certainly not an expert on this, but here are some thoughts.

I use this corpus. It is only 15 million words, but it is free. When I unzip the file on the desktop I get a folder called OANC-GrAF. There are lots of annotations, but I am only interested in the txt-files:

fileNames = FileNames["*.txt", "~/Desktop/OANC-GrAF/", Infinity];

Altogether there are

Length[fileNames]

8824 txt-files. We can import and analyse all sentences. Here I only use the first three txt-files and only the first 5 sentences, to check whether it works:

Column[Framed /@ (TextStructure /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])]

enter image description here

I can extract lots of information such as:

(TextStructure[#, "PartOfSpeech"] & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])

enter image description here

or like this

(Normal[TextStructure[#, "PartOfSpeech"]] /. TextElement -> List & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])

enter image description here

I can also look for nouns and verbs in the sentences:

({DeleteDuplicates[TextCases[#, "Noun" | "ProperNoun" | "Pronoun"]], DeleteDuplicates[TextCases[#, "Verb"]]} & /@ 
  Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])

enter image description here

We can now make a graph of this by drawing edges between nouns and verbs in the same sentence:

Graph[Flatten[
  Outer[Rule, #[[1]], #[[2]]] & /@ ({DeleteDuplicates[
        TextCases[#, "Noun" | "ProperNoun" | "Pronoun"]], 
       DeleteDuplicates[TextCases[#, "Verb"]]} & /@ 
     Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 1]]])], 
 VertexLabels -> "Name", VertexLabelStyle -> Directive[Red, 14], 
 EdgeStyle -> Directive[Arrowheads[{{0.01, 0.6}}], Opacity[0.2]], 
 VertexSize -> Medium, GraphLayout -> "BalloonEmbedding", 
 ImageSize -> Full]

enter image description here

where I only use the entire first txt-file. We can also use different types of embedding like so:

Graph[Flatten[
  Outer[Rule, #[[1]], #[[2]]] & /@ ({DeleteDuplicates[
        TextCases[#, "Noun" | "ProperNoun" | "Pronoun"]], 
       DeleteDuplicates[TextCases[#, "Verb"]]} & /@ 
     Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 1]]])], 
 VertexLabels -> "Name", VertexLabelStyle -> Directive[Red, 14], 
 VertexSize -> Medium, 
 EdgeStyle -> Directive[Arrowheads[{{0.01, 0.6}}], Opacity[0.2]], 
 GraphLayout -> {VertexLayout -> {"MultipartiteEmbedding"}}, 
 ImageSize -> Full]

enter image description here

With a bit of patience it is possible to analyse the entire corpus. For example:

WordCloud[DeleteStopwords[Flatten[TextWords[Import[#]] & /@ fileNames]], IgnoreCase -> True]

gives

enter image description here

For the analysis that you are interested in the function TextStructure in combination with the option "DependencyString" might be useful. For example

Cases[(TextStructure[#, "DependencyString"] & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])[[1]], {"nsubj", {_, _}}, Infinity]

gives

{{"nsubj", {"this", 4}}, {"nsubj", {"you", 11}}}

which makes sense given that the first sentence is"

Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]]

enter image description here

This is of course only showing the principle. It is not a solid analysis, but I hope it helps.

Cheers,

M.

POSTED BY: Marco Thiel
Posted 8 years ago

Thank you Arnoud for the quick reply. These functions seem useful indeed.

POSTED BY: Yair Lakretz

These are probably useful functions for you:

http://reference.wolfram.com/language/ref/TextStructure.html

TextStructure["The boy in white is playing with the nice girl"]

http://reference.wolfram.com/language/ref/ExampleData.html

ExampleData[{"Text","DeclarationOfIndependence"}]

http://reference.wolfram.com/language/ref/TextSentences.html

TextSentences["This is a sentence.  This is another sentence."]
POSTED BY: Arnoud Buzing
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract