Group Abstract

Message Boards

WOLFRAM COMMUNITY

18.1K Views

3 Replies

16 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Visualizing data of Leonardo DiCaprio movies

Jofre Espigule-Pons

Jofre Espigule-Pons, Wolfram Research

Posted 10 years ago

In this post I show how to create this BubbleChart representing the average words/minute and rating of the movie (the bigger the diameter of the bubble is, the better is the movie) and also show a simple way of building story flow visualization with SmoothHistogram. For the story flow you can take a look at this short tutorial Story Flow in Alice or even try coding it in a few lines in open cloud. Finally last Sunday Leonardo DiCaprio won his first Oscar with "The Revenant". He did many other great movies before, but the golden statuette always escaped from his hands. Here I will try to visualize his career trajectory by comparing his movies. In order to compare his movies I will use three variables for each movie: the year of its release, the average words/minute and its rating (taken from IMDb). The first thing that we need to do is to import the script of each movie (you can download the english subtitles in many sites). Let's import the ones for "Inception": SetDirectory[NotebookDirectory[]]; srt = Import["inception-yify-english.srt", "Text"]; The format of the script is the following: "21 00:03:15,486 --> 00:03:18,447 Yes. In the dream state, your conscious defenses are lowered... 22 00:03:18,615 --> 00:03:21,033 ...and that makes your thoughts vulnerable to theft." So, we need to do some text processing in order to get a string with all words only: partitions = StringSplit[srt, "\n"]; blocks = Flatten[Position[StringSplit[srt, "\n"], ""]]; script = Flatten[Table[Table[i, {i, blocks[[j]] + 3, blocks[[j + 1]] - 1}], {j, Length[blocks] - 1}]]; stringsrt = StringDelete[StringJoin[Flatten[Map[partitions[[#]] &, script]]], {Shortest[ "[" ~~ x___ ~~ "]"]}]; Now, we can do for example a WordCloud of the movie: WordCloud[DeleteStopwords[stringsrt], ScalingFunctions -> (#^.1 &)] But what we actually want is the StringLength of the script, so we can estimate the average words/minute later on: StringLength[stringsrt] 68719 Repeating the same for the rest of the movies, we can finally create the BubbleChart that appears on top of the page: bubbleChart = Framed[Labeled[BubbleChart[<\|"This Boy's Life 7.3" -> {1993, 45901/(4.79 115), 7.3}, "What's Eating Gilbert Grape 7.8" -> {1993, 45000/(4.79 118), 7.8}, "The Basketball Diaries 7.3" -> {1995, 45996/(4.79 102), 7.3}, "Total Eclipse 6.5" -> {1995, 29078/(4.79 111), 6.5}, "Romeo+Juliet 6.8" -> {1996, 41022/(4.79 120), 6.8}, "Titanic 7.7" -> {1997, 64681/(4.79 194), 7.7}, "The Man in the Iron Mask 6.4" -> {1998, 34450/(4.79 132), 6.4}, "The Beach 6.6" -> {2000, 42965/(4.79 119), 6.6}, "Gangs of New York 7.5" -> {2002, 53011/(4.79 167), 7.5}, "Catch Me If You Can 8.0" -> {2002, 62683/(4.79 141), 8.0}, "The Aviator 7.5" -> {2004, 83739/(4.79 170), 7.5}, "The Departed 8.5" -> {2006, 74151/(4.79 151), 8.5}, "Blood Diamond 8.0" -> {2006, 49529/(4.79 143), 8.0}, "Body of Lies 7.1" -> {2008, 58444/(4.79 128), 7.1}, "Revolutionary Road 7.3" -> {2008, 45111/(4.79 119), 7.3}, "Shutter Island 8.1" -> {2010, 55156/(4.79 138), 8.1}, "Inception 8.8" -> {2010, 61824/(4.79 148), 8.8}, "J. Edgar 6.6" -> {2011, 75697/(4.79 137), 6.6}, "The Revenant 8.2" -> {2016, 29250/(4.79 156), 8.2}, "Django Unchained 8.5" -> {2012, 68719/(4.79 165), 8.5}, "The Great Gatsby 7.3" -> {2013, 67586/(4.79 143), 7.3}, "The Wolf of Wall Street 8.2" -> {2013, 128803/(4.79 180), 8.2} \|>, ChartLabels -> Automatic, ChartStyle -> {RGBColor[0.915, 0.3325, 0.2125, 0.89], RGBColor[ 1, 0.75, 0, 0.92], RGBColor[ 0.28026441037696703`, 0.715, 0.4292089322474965, 0.93], RGBColor[0.363898, 0.618501, 0.782349], RGBColor[ 0.40082222609352647`, 0.5220066643438841, 0.85, 0.92], RGBColor[ 0.528488, 0.470624, 0.701351, 0.9], RGBColor[1, 0.58, 0, 0.89], RGBColor[0.647624, 0.37816, 0.614037, 0.9400000000000001], RGBColor[0.40082222609352647`, 0.75, 0.85, 0.87], RGBColor[ 0.736782672705901, 0.358, 0.5030266573755369, 0.9400000000000001], RGBColor[ 0.8200000000000001, 0.09, 0.89, 0.81]}, PlotTheme -> "Detailed", ImageSize -> 1300, BubbleSizes -> {0.05, 0.3}, Background -> RGBColor[0.29, 0.29, 0.29, 0.97], FrameLabel -> {Style["Year", 24, Bold], Style["words / minute", 24, Bold]}, LabelStyle -> {RGBColor[0.93, 0.93, 0.93, 0.97], 14, FontFamily -> "Helvetica", Bold}], Style["Leonardo DiCaprio Movies", 30, RGBColor[1, 0.86, 0, 0.85], FontFamily -> "Helvetic", Bold], Top], FrameMargins -> 25, Background -> RGBColor[0.27, 0.27, 0.27, 0.97]]

In this post I show how to create this BubbleChart representing the average words/minute and rating of the movie (the bigger the diameter of the bubble is, the better is the movie) and also show a simple way of building story flow visualization with SmoothHistogram. For the story flow you can take a look at this short tutorial Story Flow in Alice or even try coding it in a few lines in open cloud.

enter image description here

Finally last Sunday Leonardo DiCaprio won his first Oscar with "The Revenant". He did many other great movies before, but the golden statuette always escaped from his hands. Here I will try to visualize his career trajectory by comparing his movies.

In order to compare his movies I will use three variables for each movie: the year of its release, the average words/minute and its rating (taken from IMDb).

The first thing that we need to do is to import the script of each movie (you can download the english subtitles in many sites). Let's import the ones for "Inception":

SetDirectory[NotebookDirectory[]];    
srt = Import["inception-yify-english.srt", "Text"];

The format of the script is the following:

"21
00:03:15,486 --> 00:03:18,447
Yes. In the dream state,
your conscious defenses are lowered...

22
00:03:18,615 --> 00:03:21,033
...and that makes your thoughts
vulnerable to theft."

So, we need to do some text processing in order to get a string with all words only:

partitions = StringSplit[srt, "\n"];
blocks = Flatten[Position[StringSplit[srt, "\n"], ""]];
script = Flatten[Table[Table[i, {i, blocks[[j]] + 3, blocks[[j + 1]] - 1}], {j, Length[blocks] - 1}]];
stringsrt =  StringDelete[StringJoin[Flatten[Map[partitions[[#]] &, script]]], {Shortest[  "[" ~~ x___ ~~ "]"]}];

Now, we can do for example a WordCloud of the movie:

WordCloud[DeleteStopwords[stringsrt], ScalingFunctions -> (#^.1 &)]

enter image description here

But what we actually want is the StringLength of the script, so we can estimate the average words/minute later on:

StringLength[stringsrt]
68719

Repeating the same for the rest of the movies, we can finally create the BubbleChart that appears on top of the page:

bubbleChart = Framed[Labeled[BubbleChart[<|"This Boy's Life 7.3" -> {1993, 45901/(4.79 115), 7.3},
     "What's Eating Gilbert Grape 7.8" -> {1993, 45000/(4.79 118), 7.8},
     "The Basketball Diaries 7.3" -> {1995, 45996/(4.79 102), 7.3},
     "Total Eclipse 6.5" -> {1995, 29078/(4.79 111), 6.5},
     "Romeo+Juliet 6.8" -> {1996, 41022/(4.79 120), 6.8},
     "Titanic 7.7" -> {1997, 64681/(4.79 194), 7.7},
     "The Man in the Iron Mask 6.4" -> {1998, 34450/(4.79 132), 6.4},
     "The Beach 6.6" -> {2000, 42965/(4.79 119), 6.6},
     "Gangs of New York 7.5" -> {2002, 53011/(4.79 167), 7.5},
     "Catch Me If You Can 8.0" -> {2002, 62683/(4.79 141), 8.0},
     "The Aviator 7.5" -> {2004, 83739/(4.79 170), 7.5},
     "The Departed 8.5" -> {2006, 74151/(4.79 151), 8.5},
     "Blood Diamond 8.0" -> {2006, 49529/(4.79 143), 8.0},
     "Body of Lies 7.1" -> {2008, 58444/(4.79 128), 7.1},
     "Revolutionary Road 7.3" -> {2008, 45111/(4.79 119), 7.3},
     "Shutter Island 8.1" -> {2010, 55156/(4.79 138), 8.1},
     "Inception 8.8" -> {2010, 61824/(4.79 148), 8.8},
     "J. Edgar 6.6" -> {2011, 75697/(4.79 137), 6.6},
     "The Revenant 8.2" -> {2016, 29250/(4.79 156), 8.2},
     "Django Unchained 8.5" -> {2012, 68719/(4.79 165), 8.5},
     "The Great Gatsby 7.3" -> {2013, 67586/(4.79 143), 7.3},
     "The Wolf of Wall Street 8.2" -> {2013, 128803/(4.79 180), 8.2}
     |>,
    ChartLabels -> Automatic,
    ChartStyle -> {RGBColor[0.915, 0.3325, 0.2125, 0.89], RGBColor[
      1, 0.75, 0, 0.92], RGBColor[
      0.28026441037696703`, 0.715, 0.4292089322474965, 0.93], 
      RGBColor[0.363898, 0.618501, 0.782349], RGBColor[
      0.40082222609352647`, 0.5220066643438841, 0.85, 0.92], RGBColor[
      0.528488, 0.470624, 0.701351, 0.9], RGBColor[1, 0.58, 0, 0.89], 
      RGBColor[0.647624, 0.37816, 0.614037, 0.9400000000000001], 
      RGBColor[0.40082222609352647`, 0.75, 0.85, 0.87], RGBColor[
      0.736782672705901, 0.358, 0.5030266573755369, 
       0.9400000000000001], RGBColor[
      0.8200000000000001, 0.09, 0.89, 0.81]},
    PlotTheme -> "Detailed",
    ImageSize -> 1300,
    BubbleSizes -> {0.05, 0.3},
    Background -> RGBColor[0.29, 0.29, 0.29, 0.97],
    FrameLabel -> {Style["Year", 24, Bold], 
      Style["words / minute", 24, Bold]},
    LabelStyle -> {RGBColor[0.93, 0.93, 0.93, 0.97], 14, 
      FontFamily -> "Helvetica", Bold}], 
   Style["Leonardo DiCaprio Movies", 30, RGBColor[1, 0.86, 0, 0.85], 
    FontFamily -> "Helvetic", Bold], Top], FrameMargins -> 25, 
  Background -> RGBColor[0.27, 0.27, 0.27, 0.97]]

POSTED BY: Jofre Espigule-Pons

3 Replies

Sort By:

eru era

Posted 10 years ago

it should be: script = Flatten[Table[Table[i, {i, blocks[[j]] + 3, blocks[[j + 1]] - 1}], {j, Length[blocks] - 1}]]; you lost the 'Table'

POSTED BY: eru era

Thomas Eli

Posted 10 years ago

That's a good example of great visual data , awesome work Jofre :)

POSTED BY: Thomas Eli

Jofre Espigule-Pons

Jofre Espigule-Pons, Wolfram Research

Posted 10 years ago

Alternatively, we can focus on the script of "The Revenant Movie" (I got the script from the internet Movie Script Database (IMSDb)). We can use the SmoothHistogram in order to compute the probability density function of names over the movie and visualize when they are playing an important role. For example, we can do that with the four main roles: SmoothHistogram[{ Legended[First /@ StringPosition[ToLowerCase[scriptString], "glass"],Style["Hugh Glass (Leonardo DiCaprio)", Bold, 14]], Legended[First /@ StringPosition[ToLowerCase[scriptString], "henry"], Style["Captain Andrew Henry (Domhnall Gleeson)", Bold, 14]], Legended[First /@ StringPosition[ToLowerCase[scriptString], "fitzgerald"],Style["John Fitzgerald (Tom Hardy)", Bold, 14]], Legended[First /@ StringPosition[ToLowerCase[scriptString], "bridger"],Style["Jim Bridger (Will Poulter)", Bold, 14]]}, AxesLabel -> {Style["number of letters", Bold, 14],Style["probability density function", Bold, 14]}, Filling -> Axis, ImageSize -> Large] See also the probability density function of other groups of words (spoiler alert): I hope you found this interesting!

Alternatively, we can focus on the script of "The Revenant Movie" (I got the script from the internet Movie Script Database (IMSDb)). We can use the SmoothHistogram in order to compute the probability density function of names over the movie and visualize when they are playing an important role. For example, we can do that with the four main roles:

SmoothHistogram[{
Legended[First /@ StringPosition[ToLowerCase[scriptString], "glass"],Style["Hugh Glass (Leonardo DiCaprio)", Bold, 14]],
Legended[First /@ StringPosition[ToLowerCase[scriptString], "henry"], Style["Captain Andrew Henry (Domhnall Gleeson)", Bold, 14]],
Legended[First /@ StringPosition[ToLowerCase[scriptString], "fitzgerald"],Style["John Fitzgerald (Tom Hardy)", Bold, 14]],
Legended[First /@ StringPosition[ToLowerCase[scriptString], "bridger"],Style["Jim Bridger (Will Poulter)", Bold, 14]]}, 
AxesLabel -> {Style["number of letters", Bold, 14],Style["probability density function", Bold, 14]},
Filling -> Axis, 
ImageSize -> Large]

enter image description here

See also the probability density function of other groups of words (spoiler alert): enter image description here

I hope you found this interesting!

POSTED BY: Jofre Espigule-Pons

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback