Group Abstract

Message Boards

WOLFRAM COMMUNITY

11.6K Views

4 Replies

4 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Large corpus of parsed sentences in English

Yair Lakretz

Posted 11 years ago

Hello, for a project in visual scene understanding I'm looking for a large corpus with parsed sentences from text (books, etc), from which I could extract Subject-Verb-Object (SVO) statistics. I would like to know for example frequencies of SVO triplets such as boy-play-girl in 'The boy in white is playing with the nice girl'. Thank you.

POSTED BY: Yair Lakretz

4 Replies

Sort By:

Yair Lakretz

Posted 11 years ago

Marco, Thanks a lot for the detailed answer. This is really helpful!

POSTED BY: Yair Lakretz

Marco Thiel

Marco Thiel, University of Aberdeen - Dept. of Physics/Mathematics

Posted 11 years ago

Hi, I am certainly not an expert on this, but here are some thoughts. I use this corpus. It is only 15 million words, but it is free. When I unzip the file on the desktop I get a folder called OANC-GrAF. There are lots of annotations, but I am only interested in the txt-files: fileNames = FileNames["*.txt", "~/Desktop/OANC-GrAF/", Infinity]; Altogether there are Length[fileNames] 8824 txt-files. We can import and analyse all sentences. Here I only use the first three txt-files and only the first 5 sentences, to check whether it works: Column[Framed /@ (TextStructure /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])] I can extract lots of information such as: (TextStructure[#, "PartOfSpeech"] & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]]) or like this (Normal[TextStructure[#, "PartOfSpeech"]] /. TextElement -> List & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]]) I can also look for nouns and verbs in the sentences: ({DeleteDuplicates[TextCases[#, "Noun" \| "ProperNoun" \| "Pronoun"]], DeleteDuplicates[TextCases[#, "Verb"]]} & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]]) We can now make a graph of this by drawing edges between nouns and verbs in the same sentence: Graph[Flatten[ Outer[Rule, #[[1]], #[[2]]] & /@ ({DeleteDuplicates[ TextCases[#, "Noun" \| "ProperNoun" \| "Pronoun"]], DeleteDuplicates[TextCases[#, "Verb"]]} & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 1]]])], VertexLabels -> "Name", VertexLabelStyle -> Directive[Red, 14], EdgeStyle -> Directive[Arrowheads[{{0.01, 0.6}}], Opacity[0.2]], VertexSize -> Medium, GraphLayout -> "BalloonEmbedding", ImageSize -> Full] where I only use the entire first txt-file. We can also use different types of embedding like so: Graph[Flatten[ Outer[Rule, #[[1]], #[[2]]] & /@ ({DeleteDuplicates[ TextCases[#, "Noun" \| "ProperNoun" \| "Pronoun"]], DeleteDuplicates[TextCases[#, "Verb"]]} & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 1]]])], VertexLabels -> "Name", VertexLabelStyle -> Directive[Red, 14], VertexSize -> Medium, EdgeStyle -> Directive[Arrowheads[{{0.01, 0.6}}], Opacity[0.2]], GraphLayout -> {VertexLayout -> {"MultipartiteEmbedding"}}, ImageSize -> Full] With a bit of patience it is possible to analyse the entire corpus. For example: WordCloud[DeleteStopwords[Flatten[TextWords[Import[#]] & /@ fileNames]], IgnoreCase -> True] gives For the analysis that you are interested in the function TextStructure in combination with the option "DependencyString" might be useful. For example Cases[(TextStructure[#, "DependencyString"] & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])[[1]], {"nsubj", {_, _}}, Infinity] gives {{"nsubj", {"this", 4}}, {"nsubj", {"you", 11}}} which makes sense given that the first sentence is" Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]] This is of course only showing the principle. It is not a solid analysis, but I hope it helps. Cheers, M.

Hi,

I am certainly not an expert on this, but here are some thoughts.

I use this corpus. It is only 15 million words, but it is free. When I unzip the file on the desktop I get a folder called OANC-GrAF. There are lots of annotations, but I am only interested in the txt-files:

fileNames = FileNames["*.txt", "~/Desktop/OANC-GrAF/", Infinity];

Altogether there are

Length[fileNames]

8824 txt-files. We can import and analyse all sentences. Here I only use the first three txt-files and only the first 5 sentences, to check whether it works:

Column[Framed /@ (TextStructure /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])]

enter image description here

I can extract lots of information such as:

(TextStructure[#, "PartOfSpeech"] & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])

enter image description here

or like this

(Normal[TextStructure[#, "PartOfSpeech"]] /. TextElement -> List & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])

enter image description here

I can also look for nouns and verbs in the sentences:

({DeleteDuplicates[TextCases[#, "Noun" | "ProperNoun" | "Pronoun"]], DeleteDuplicates[TextCases[#, "Verb"]]} & /@ 
  Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])

enter image description here

We can now make a graph of this by drawing edges between nouns and verbs in the same sentence:

Graph[Flatten[
  Outer[Rule, #[[1]], #[[2]]] & /@ ({DeleteDuplicates[
        TextCases[#, "Noun" | "ProperNoun" | "Pronoun"]], 
       DeleteDuplicates[TextCases[#, "Verb"]]} & /@ 
     Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 1]]])], 
 VertexLabels -> "Name", VertexLabelStyle -> Directive[Red, 14], 
 EdgeStyle -> Directive[Arrowheads[{{0.01, 0.6}}], Opacity[0.2]], 
 VertexSize -> Medium, GraphLayout -> "BalloonEmbedding", 
 ImageSize -> Full]

enter image description here

where I only use the entire first txt-file. We can also use different types of embedding like so:

Graph[Flatten[
  Outer[Rule, #[[1]], #[[2]]] & /@ ({DeleteDuplicates[
        TextCases[#, "Noun" | "ProperNoun" | "Pronoun"]], 
       DeleteDuplicates[TextCases[#, "Verb"]]} & /@ 
     Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 1]]])], 
 VertexLabels -> "Name", VertexLabelStyle -> Directive[Red, 14], 
 VertexSize -> Medium, 
 EdgeStyle -> Directive[Arrowheads[{{0.01, 0.6}}], Opacity[0.2]], 
 GraphLayout -> {VertexLayout -> {"MultipartiteEmbedding"}}, 
 ImageSize -> Full]

enter image description here

With a bit of patience it is possible to analyse the entire corpus. For example:

WordCloud[DeleteStopwords[Flatten[TextWords[Import[#]] & /@ fileNames]], IgnoreCase -> True]

gives

enter image description here

For the analysis that you are interested in the function TextStructure in combination with the option "DependencyString" might be useful. For example

Cases[(TextStructure[#, "DependencyString"] & /@ Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]])[[1]], {"nsubj", {_, _}}, Infinity]

gives

{{"nsubj", {"this", 4}}, {"nsubj", {"you", 11}}}

which makes sense given that the first sentence is"

Flatten[TextSentences[Import[#]] & /@ fileNames[[1 ;; 3]]][[1 ;; 5]]

enter image description here

This is of course only showing the principle. It is not a solid analysis, but I hope it helps.

Cheers,

POSTED BY: Marco Thiel

Yair Lakretz

Posted 11 years ago

Thank you Arnoud for the quick reply. These functions seem useful indeed.

POSTED BY: Yair Lakretz

Arnoud Buzing

Arnoud Buzing, Wolfram Research

Posted 11 years ago

These are probably useful functions for you: http://reference.wolfram.com/language/ref/TextStructure.html TextStructure["The boy in white is playing with the nice girl"] http://reference.wolfram.com/language/ref/ExampleData.html ExampleData[{"Text","DeclarationOfIndependence"}] http://reference.wolfram.com/language/ref/TextSentences.html TextSentences["This is a sentence. This is another sentence."]

POSTED BY: Arnoud Buzing

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback