Message Boards Message Boards

Use Machine Learning to find the best movie ever!

Hi everyone,

in one of the recent discussions I used the powerful Classify function of Mathematica 10 to try to make up for my lack of knowledge of fruits. It appears that the Classify function has quite some nice applications, particularly when combined with data scraped from the internet. I then remembered this beautiful Wolfram Blog article by Matthias Odisio about finding a good sequence of movies to watch. I believe that his approach is based on the covers/poster of the movies. I thought that now with Mathematica 10 we might try that again using a bit more information and very little (!) programming, Actually it will work in one line:

c = Classify[Table[Flatten[data[[k]]] -> ret[[k]], {k, 1, 150}]]

WARNING: If you evaluate the commands you are effectively downloading many websites automatically. Make sure that you do not violate the terms and conditions. Take this post as a description of the principle idea of how do to such things! If you run the code, check that you don't infringe any terms and conditions.

Let's start with getting some data. The IMDb has a nice website with a ranking of the 250 best films ever. Let's import the ranking:

ranking = Import["http://www.imdb.com/chart/top", "Data"][[4 ;; 253, 2]];

and list the top 10:

Grid[ranking[[1 ;; 10]], Frame -> All]

enter image description here

It is quite straight forward to get the hyperlinks to the movies like so:

links = Import["http://www.imdb.com/chart/top", "Hyperlinks"];

This command cleans that list up and only gives links to the films:

moviesite = DeleteDuplicates[Select[links, StringMatchQ[#, "http://www.imdb.com/title/" ~~ ___ ~~ "ref_=chttp_tt" ~~ ___] &]];

The results (well the first ten of them) look like this:

enter image description here

The next part needs a bit of work. We need to identify the important pieces of information on those websites. I am interested in: poster/cover image, title and year, director, lead actors, and short synopsis. The following slightly lengthly and clumsy code does the trick:

data = Monitor[
  Reap[For[i = 1, i <= Length[moviesite], i++, 
     site = Import[moviesite[[i]], "Text"]; 
     Sow[{Import[moviesite[[i]], "Images"][[4]], 
       StringDrop[
        StringDrop[
         StringSplit[
            StringSplit[
               StringSplit[site, " <meta name=\"title\" content="], 
               "<meta name=\"keywords\" content=\"Reviews, Showtimes, \
DVDs, Photos, Message Boards, User Ratings, Synopsis, Trailers, \
Credits\>"][[2]][[1]], "/\>"][[1 ;; 2]][[1]], -9], 1],  
       StringDrop[
        StringDrop[
         StringDrop[
          StringSplit[
             StringSplit[
                StringSplit[site, " <meta name=\"title\" content="], 
                "<meta name=\"keywords\" content=\"Reviews, \
Showtimes, DVDs, Photos, Message Boards, User Ratings, Synopsis, \
Trailers, Credits\>"][[2]][[1]], "/\>"][[1 ;; 2]][[2]], 
          44], -Length[
           Characters[
             StringSplit[
              StringDrop[
               StringSplit[
                  StringSplit[
                    StringSplit[site, 
                    " <meta name=\"title\" content="], 
                    "<meta name=\"keywords\" content=\"Reviews, \
Showtimes, DVDs, Photos, Message Boards, User Ratings, Synopsis, \
Trailers, Credits\>"][[2]][[1]], "/\>"][[1 ;; 2]][[2]], 44], 
              "Directed by" ~~ ___ ~~ ".  With"]][[1]]]], -6],
       StringTake[
        StringSplit[
         StringDrop[
          StringSplit[
             StringSplit[
                StringSplit[site, " <meta name=\"title\" content="], 
                "<meta name=\"keywords\" content=\"Reviews, \
Showtimes, DVDs, Photos, Message Boards, User Ratings, Synopsis, \
Trailers, Credits\>"][[2]][[1]], "/\>"][[1 ;; 2]][[2]], 44], 
         "Directed by" ~~ ___ ~~ ".  With"], 
        Flatten[StringPosition[
           StringSplit[
            StringDrop[
             StringSplit[
                StringSplit[
                   StringSplit[site, 
                    " <meta name=\"title\" content="], 
                   "<meta name=\"keywords\" content=\"Reviews, \
Showtimes, DVDs, Photos, Message Boards, User Ratings, Synopsis, \
Trailers, Credits\>"][[2]][[1]], "/\>"][[1 ;; 2]][[2]], 44], 
            "Directed by" ~~ ___ ~~ ".  With"], "."]][[1]]],
       StringDrop[
        StringDrop[
         StringSplit[
          StringDrop[
           StringSplit[
              StringSplit[
                 StringSplit[site, " <meta name=\"title\" content="], 
                 "<meta name=\"keywords\" content=\"Reviews, \
Showtimes, DVDs, Photos, Message Boards, User Ratings, Synopsis, \
Trailers, Credits\>"][[2]][[1]], "/\>"][[1 ;; 2]][[2]], 44], 
          "Directed by" ~~ ___ ~~ ".  With"], 
         Flatten[StringPosition[
            StringSplit[
             StringDrop[
              StringSplit[
                 StringSplit[
                    StringSplit[site, 
                    " <meta name=\"title\" content="], 
                    "<meta name=\"keywords\" content=\"Reviews, \
Showtimes, DVDs, Photos, Message Boards, User Ratings, Synopsis, \
Trailers, Credits\>"][[2]][[1]], "/\>"][[1 ;; 2]][[2]], 44], 
             "Directed by" ~~ ___ ~~ ".  With"], "."]][[1]]], -1]
       }]]][[2, 1]], i]

A typical entry looks like this:

enter image description here

Downloading all info can take 30-40 minutes. Ok. Now comes a little bit of work. I have programmed a little (clumsy) interface where I can rate the first 150 movies - the training set.

ret = {};
For[i = 1, i <= 150, i++,  ret = AppendTo[ret, ChoiceDialog[{ImageResize[data[[i, 1]], 100], data[[i, 2]], data[[i, 3]], data[[i, 4]], data[[i, 5]]} // 
TableForm, {"bad" -> "bad", "good" -> "good"}, WindowSize -> Large]]]

It is primitive, and I would be more than happy if people want to improve it. The interface looks like this:

enter image description here

It shows all relevant information though very badly formatted. The good/bad rating is saved in the variable ret. We can finally train our Classify function

c = Classify[Table[Flatten[data[[k]]] -> ret[[k]], {k, 1, 150}]]

It uses NearestNeighbors, and seems to work just fine. We can now see what classify thinks I like or dislike.

RandomChoice[Table[{data[[k, 1]], data[[k, 2]], c[Flatten[data[[k]]], "Probabilities"]}, {k, 151, 250}], 1] // TableForm

which gives:

enter image description here

So I apparently do like Jaws more than I dislike it. Interesting. Of course, if you have the data you can now check large numbers of films at the click of a button:

Table[{data[[k, 1]], data[[k, 2]], c[Flatten[data[[k]]], "Probabilities"]}, {k, 151, 250}] // TableForm

It is easy to train the system better if you have better data on what you like or dislike. Obviously, Netflix and Amazon will use similar things to advertise something you might like to you. The amazing thing for me is that, once you have the database, all that can be done with one line of code:

c = Classify[Table[Flatten[data[[k]]] -> ret[[k]], {k, 1, 150}]]

I will attach a notebook with all commands.

Disclaimer: I do not know whether you breach the Terms and Conditions of IMDb. You run this command on your own risk.

Hope you enjoy the results and the power of Classify[],

Marco

Attachments:
POSTED BY: Marco Thiel
2 Replies

Holy guacamole (sorry, that was just a cheap excuse to evoke an avocado).

enter image description here

POSTED BY: Daniel Lichtblau

Interestingly a student at this year's Wolfram Science Summer School used a similar set of tools to--via machine learning and Mathematica's classify function--create a predictor of a film's profitability (and using a similar variety of training data that you made use of). She then created a cloud API to produce a web application using the approach.

POSTED BY: David Reiss
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract