Message Boards Message Boards

Visualizing data of Leonardo DiCaprio movies

In this post I show how to create this BubbleChart representing the average words/minute and rating of the movie (the bigger the diameter of the bubble is, the better is the movie) and also show a simple way of building story flow visualization with SmoothHistogram. For the story flow you can take a look at this short tutorial Story Flow in Alice or even try coding it in a few lines in open cloud.

enter image description here

enter image description here

Finally last Sunday Leonardo DiCaprio won his first Oscar with "The Revenant". He did many other great movies before, but the golden statuette always escaped from his hands. Here I will try to visualize his career trajectory by comparing his movies.

In order to compare his movies I will use three variables for each movie: the year of its release, the average words/minute and its rating (taken from IMDb).

The first thing that we need to do is to import the script of each movie (you can download the english subtitles in many sites). Let's import the ones for "Inception":

SetDirectory[NotebookDirectory[]];    
srt = Import["inception-yify-english.srt", "Text"];

The format of the script is the following:

"21
00:03:15,486 --> 00:03:18,447
Yes. In the dream state,
your conscious defenses are lowered...

22
00:03:18,615 --> 00:03:21,033
...and that makes your thoughts
vulnerable to theft."

So, we need to do some text processing in order to get a string with all words only:

partitions = StringSplit[srt, "\n"];
blocks = Flatten[Position[StringSplit[srt, "\n"], ""]];
script = Flatten[Table[Table[i, {i, blocks[[j]] + 3, blocks[[j + 1]] - 1}], {j, Length[blocks] - 1}]];
stringsrt =  StringDelete[StringJoin[Flatten[Map[partitions[[#]] &, script]]], {Shortest[  "[" ~~ x___ ~~ "]"]}];

Now, we can do for example a WordCloud of the movie:

WordCloud[DeleteStopwords[stringsrt], ScalingFunctions -> (#^.1 &)]

enter image description here

But what we actually want is the StringLength of the script, so we can estimate the average words/minute later on:

StringLength[stringsrt]
68719

Repeating the same for the rest of the movies, we can finally create the BubbleChart that appears on top of the page:

bubbleChart = Framed[Labeled[BubbleChart[<|"This Boy's Life 7.3" -> {1993, 45901/(4.79 115), 7.3},
     "What's Eating Gilbert Grape 7.8" -> {1993, 45000/(4.79 118), 7.8},
     "The Basketball Diaries 7.3" -> {1995, 45996/(4.79 102), 7.3},
     "Total Eclipse 6.5" -> {1995, 29078/(4.79 111), 6.5},
     "Romeo+Juliet 6.8" -> {1996, 41022/(4.79 120), 6.8},
     "Titanic 7.7" -> {1997, 64681/(4.79 194), 7.7},
     "The Man in the Iron Mask 6.4" -> {1998, 34450/(4.79 132), 6.4},
     "The Beach 6.6" -> {2000, 42965/(4.79 119), 6.6},
     "Gangs of New York 7.5" -> {2002, 53011/(4.79 167), 7.5},
     "Catch Me If You Can 8.0" -> {2002, 62683/(4.79 141), 8.0},
     "The Aviator 7.5" -> {2004, 83739/(4.79 170), 7.5},
     "The Departed 8.5" -> {2006, 74151/(4.79 151), 8.5},
     "Blood Diamond 8.0" -> {2006, 49529/(4.79 143), 8.0},
     "Body of Lies 7.1" -> {2008, 58444/(4.79 128), 7.1},
     "Revolutionary Road 7.3" -> {2008, 45111/(4.79 119), 7.3},
     "Shutter Island 8.1" -> {2010, 55156/(4.79 138), 8.1},
     "Inception 8.8" -> {2010, 61824/(4.79 148), 8.8},
     "J. Edgar 6.6" -> {2011, 75697/(4.79 137), 6.6},
     "The Revenant 8.2" -> {2016, 29250/(4.79 156), 8.2},
     "Django Unchained 8.5" -> {2012, 68719/(4.79 165), 8.5},
     "The Great Gatsby 7.3" -> {2013, 67586/(4.79 143), 7.3},
     "The Wolf of Wall Street 8.2" -> {2013, 128803/(4.79 180), 8.2}
     |>,
    ChartLabels -> Automatic,
    ChartStyle -> {RGBColor[0.915, 0.3325, 0.2125, 0.89], RGBColor[
      1, 0.75, 0, 0.92], RGBColor[
      0.28026441037696703`, 0.715, 0.4292089322474965, 0.93], 
      RGBColor[0.363898, 0.618501, 0.782349], RGBColor[
      0.40082222609352647`, 0.5220066643438841, 0.85, 0.92], RGBColor[
      0.528488, 0.470624, 0.701351, 0.9], RGBColor[1, 0.58, 0, 0.89], 
      RGBColor[0.647624, 0.37816, 0.614037, 0.9400000000000001], 
      RGBColor[0.40082222609352647`, 0.75, 0.85, 0.87], RGBColor[
      0.736782672705901, 0.358, 0.5030266573755369, 
       0.9400000000000001], RGBColor[
      0.8200000000000001, 0.09, 0.89, 0.81]},
    PlotTheme -> "Detailed",
    ImageSize -> 1300,
    BubbleSizes -> {0.05, 0.3},
    Background -> RGBColor[0.29, 0.29, 0.29, 0.97],
    FrameLabel -> {Style["Year", 24, Bold], 
      Style["words / minute", 24, Bold]},
    LabelStyle -> {RGBColor[0.93, 0.93, 0.93, 0.97], 14, 
      FontFamily -> "Helvetica", Bold}], 
   Style["Leonardo DiCaprio Movies", 30, RGBColor[1, 0.86, 0, 0.85], 
    FontFamily -> "Helvetic", Bold], Top], FrameMargins -> 25, 
  Background -> RGBColor[0.27, 0.27, 0.27, 0.97]]
3 Replies
Posted 8 years ago

it should be: script = Flatten[Table[Table[i, {i, blocks[[j]] + 3, blocks[[j + 1]] - 1}], {j, Length[blocks] - 1}]]; you lost the 'Table'

POSTED BY: eru era
Posted 8 years ago

That's a good example of great visual data , awesome work Jofre :)

POSTED BY: Thomas Eli

Alternatively, we can focus on the script of "The Revenant Movie" (I got the script from the internet Movie Script Database (IMSDb)). We can use the SmoothHistogram in order to compute the probability density function of names over the movie and visualize when they are playing an important role. For example, we can do that with the four main roles:

SmoothHistogram[{
Legended[First /@ StringPosition[ToLowerCase[scriptString], "glass"],Style["Hugh Glass (Leonardo DiCaprio)", Bold, 14]],
Legended[First /@ StringPosition[ToLowerCase[scriptString], "henry"], Style["Captain Andrew Henry (Domhnall Gleeson)", Bold, 14]],
Legended[First /@ StringPosition[ToLowerCase[scriptString], "fitzgerald"],Style["John Fitzgerald (Tom Hardy)", Bold, 14]],
Legended[First /@ StringPosition[ToLowerCase[scriptString], "bridger"],Style["Jim Bridger (Will Poulter)", Bold, 14]]}, 
AxesLabel -> {Style["number of letters", Bold, 14],Style["probability density function", Bold, 14]},
Filling -> Axis, 
ImageSize -> Large]

enter image description here

See also the probability density function of other groups of words (spoiler alert): enter image description here enter image description here enter image description here enter image description here

I hope you found this interesting!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract