Message Boards Message Boards

Comparing translations of "Heart of a Dog."

Posted 7 years ago

Hi Wolfram Community,

This post is a branch from Vitaliy's earlier thread about textual comparison. I find online pdf translations for "Heart of a Dog" and process the text as follows:

PDF Source 1 $\longrightarrow$ plaintext cipher

PDF Source 2 $\longrightarrow$ plaintext cipher

It's easy to get going right away with word-by-word analysis, and I produce the following distribution of words, showing fluctuations between the texts:

ParsedTexts = StringSplit[#," "]&/@TextStrings (* as in the ciphers hyperlinked above *)
Length /@ ParsedTexts
Length[Union[#]] & /@ ParsedTexts 
Out[1] = {32399, 34772}
Out[2] = {5060, 5197}

Tallies = Tally /@ ParsedTexts;
Words = Union @@ ParsedTexts;
AbsoluteTiming[
 CompareCounts = 
   ReplaceAll[
    Prepend[(Function[{a}, Cases[Tallies[[a]], {#, b_} :> b]] /@ {1, 
          2}), #] & /@ Words, {{} -> 0, {x_Integer} :> x}]; ]
CompareDiff = CompareCounts /. {x_, y_, z_} :> {x, y - z};
Histogram[CompareDiff[[All, 2]], PlotRange -> {{-10, 10}, {0, 3000}}]

Diff Distribution

The graph shows preponderance of difference around $0$, and the texts each contain about 2000 words not found in the other. This is by no means a perfect analysis, and probably contains some processing errors, so should be taken lightly.

Now we can also import Vitaliy's algorithm and find some interesting results:

ideaNET[text_String, order_] := 
 Module[{wordsTOP, edges, resctal, 
   words = TextWords[DeleteStopwords[ToLowerCase[text]]]}, 
  resctal = 
   Transpose[MapAt[N[Rescale[#]] &, Transpose[Tally[words]], 2]];
  wordsTOP = Select[resctal, Last[#] >= order &];
  edges = 
   UndirectedEdge @@@ 
    DeleteDuplicates[
     Sort /@ DeleteCases[
       Partition[Cases[words, Alternatives @@ wordsTOP[[All, 1]]], 2, 
        1], {x_String, x_String}]];
  CommunityGraphPlot[
   Graph[edges, 
    VertexSize -> 
     Thread[wordsTOP[[All, 1]] -> .1 + .9 wordsTOP[[All, 2]]], 
    VertexLabels -> Automatic, 
    VertexLabelStyle -> Directive[20, White, Opacity[.8]], 
    GraphStyle -> "Prototype", Background -> Black], 
   CommunityBoundaryStyle -> Directive[GrayLevel[.4], Dashed], 
   CommunityRegionStyle -> GrayLevel[.2], ImageSize -> 500 {1, 1}, 
   PlotRangePadding -> {{.1, .3}, {0.1, 0.1}}]]

{ideaNET[StringJoin[StringRiffle[#, " "]], .15] & /@ ParsedTexts,
  ideaNET[StringJoin[StringRiffle[#, " "]], .1] & /@ 
   ParsedTexts} // TableForm

enter image description here

So we see that applying a variation of the cut parameter causes a bifurcation in the graphs. What is the meaning of this bifurcation? It's especially intriguing when you consider that one translation graph has Zina and Darya in the same community, while the other has Zina and Darya in seperate communites. I don't know much about the function CommunityGraphPlot, so maybe one of the experts here would care to comment on the meaning of this discrepancy? In any case, it's nice to see this technology put to some positive use, as I would not be interested in the least, to see a graph of some social networking site.

What is the next step in our Magnificent Integral? We could possibly find a way to rip the subtitles out of the following video adaptation from 1988, during the Dissolution of the Soviet Union: ??????? ?????? (1988) !

To conclude, let's do one last comparison of the cipher texts:

According to my calculations, the name "Vasnetsova" appears twice in one of the translated texts and zero times in the other. It's reminding me of a painting I've seen from inside St. Vladimir's Cathedral in Kiev, The Russian Bishops. Someday it would be nice to visit the Bulgakov Museum and to tour the cathedral, but I'm afraid it's not the year to do so.

Bradley Klee

POSTED BY: Brad Klee
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract