Message Boards Message Boards

Analyzing a Dataset of Game Releases

Hi, I'm Rob Lockhart, Creative Director of Important Little Games. I'd be grateful if you followed me on twitter.

It all started when I stumbled across this misleadingly-titled Polygon article written last year and followed the link to the data source out of curiosity. Basically it's just a list of videogame titles, some of which have been annotated with a developer, a year, and/or a platform. Since I'm fond of semi-structured data sources, I downloaded the list, which had grown to nearly 150,000 titles since the Polygon article was published, and started to play around in Wolfram Language. As you read on, be advised that this is an extremely noisy dataset and does not necessarily reflect the videogames industry's history, or even the titles it lists. Here is import and initial cleaning:

gameliststring = Import["~/Documents/list_of_every_video_game_ever_(v3).txt"];

gamelist = (StringTrim[#, 
       RegularExpression["[\\s\\(\\)]*"]] & /@ (StringSplit[#, 
        RegularExpression["\\)?\\s*\\("]] &)) /@ 
   StringSplit[gameliststring, "\n"][[4 ;;]];

Length[gamelist]

149665

The first thing I did was take a look at the top words that occur in videogame titles. There were 150,000 game titles and a vocabulary of around 45,000 unique words. About 21,000 of these were used only once in any game title. For scale, consider that apparently it is not uncommon for a native speaker to have 20,000-35,000 words in their whole vocabulary.

Let's take a look at the top 50 words I found:

titlewords = SortBy[Tally[Flatten[StringSplit[ToLowerCase[#], 
        RegularExpression@"[^\\w\\']+"] & /@ gamelist[[All, 1]]]], 1/#[[2]] &];

BarChart[titlewords[[;; 50, 2]], 
 ChartLabels -> Placed[titlewords[[;; 50, 1]], {{0.5, 0}, {0.9, 1}}, 
   Rotate[#, (2/7) Pi] &], ChartStyle -> 24, ImageSize -> {800, 500}]

enter image description here

There are a lot of words that are completely unsurprising, as they are overwhelmingly frequent throughout English. Numerals, both Arabic and Roman, play a big role, meaning that there are a lot of sequels. Frustrating for those of us who value originality in interactive entertainment, but by no means surprising. Let's filter out these uninteresting results and look again:

nontrivial = 
  SortBy[ReplaceRepeated[
    DeleteCases[
     titlewords, {"vol" | "the" | "and" | "a" | "of" | "in" | "no" | 
       "to" | "for" | "is" | "1" | "10" | "13" | 
       Alternatives @@ (ToString /@ Range[2, 5]) | "ii" | "iii" | 
       Alternatives @@ (ToString /@ Range[2000, 2016]) | 
       Alternatives @@ ("0" <> ToString[#] & /@ 
          Range[5, 9]), _}], {a___, {b_, c_}, d___, {e_, f_}, g___} /;
       StringMatchQ[e, b ~~ "s" | "es"] :> {a, {b, c + f}, d, g}, 
    MaxIterations -> 25], 1/#[[2]] &];

BarChart[nontrivial[[;; 50, 2]], 
 ChartLabels -> 
  Placed[nontrivial[[;; 50, 1]], {{0.5, 0}, {0.9, 1}}, 
   Rotate[#, (2/7) Pi] &], ChartStyle -> 2, ImageSize -> {800, 500}]

enter image description here

Length[Cases[titlewords, {_, 1}]]

21797

Length[titlewords]

44577

I also recombined plurals into the root word.

In my humble opinion, it really sucks that 'war' shows up second, after 'game.' There's nothing wrong with war as a theme for any particular game, but our industry's singular focus on war and violence becomes pretty tiresome, as this chart exemplifies. Which word would I prefer in second place? 'Magic,' of course!

I also noticed that there were quite a lot of games which use subtitles. Not the written dialogue at the bottom of the cutscenes, but the second part of a title separated by a colon. Things like the underlined part of "Call of Warfare: Modern Videogame ." Let's take a look at the most common subtitles:

subtitles = 
  SortBy[Tally[((StringSplit[#, 
            ": "] /. {a_} :> {"No Subtitle"})[[-1]] & /@ 
       gamelist[[All, 1]])], 1/#[[2]] &][[;; 500]];

BarChart[subtitles[[2 ;; 50, 2]], 
  ChartLabels -> 
   Placed[subtitles[[2 ;; 50, 1]], {{0.5, 0}, {0.9, 1}}, 
    Rotate[#, (2/7) Pi] &] , ChartStyle -> "Rainbow"] /. 
 ImageScaled[{1/2, 1}] -> ImageScaled[{0.9, 1}]

enter image description here

'The Game' and 'Gold Edition' seem to make sense, but for some reason 'The Movie' comes in third. Why are there so many games (56) with ': The Movie' in the title?!

I'm not very fond of this naming pattern in the first place, but some of these should unquestionably be retired. Let's not name any more games "Something Something: Vengeance" shall we?

As I mentioned earlier, some of the entries in the data are tagged with a developer, year, and/or platform. I found the developers more or less impossible to extract systematically, but I had better luck with years and platforms.

About 1/5 of the games were tagged with a year, but they were represented unevenly. As you can see below, only the years from 2000 to 2015 had any kind of decent coverage. It's interesting to note that within that period, the number of games released per year did not increase or decrease significantly (if this dataset can be taken as a representative sample).

withYear = #[[1, 2]] -> #[[All, 1]] & /@ 
   GatherBy[
    Cases[gamelist, {name_, ___, 
       a_String /; 
        StringMatchQ[a, 
         RegularExpression[
          "[\\w\\s,]*(?:19[789][0-9])|(?:20[01][0-9])[\\w\\s,]*"]], \
___} :> {name, 
       ToExpression[
        StringCases[a, 
          RegularExpression@"(?:19[789][0-9])|(?:20[01][0-9])"][[
         1]]]}], #[[2]] &];

N[Length[Join @@ withYear[[All, 2]]]/Length[gamelist]]

0.200889

numPerYear = 
 Thread[{Range[1984, 2016], 
   Replace[(Range[1984, 2016] /. withYear), {a_List :> Length[a], 
     b_?NumericQ :> 0}, 1]}]

{{1984, 1}, {1985, 0}, {1986, 0}, {1987, 0}, {1988, 1}, {1989, 0}, {1990, 2}, {1991, 3}, {1992, 0}, {1993, 0}, {1994, 1}, {1995, 0}, {1996, 1}, {1997, 0}, {1998, 1}, {1999, 1}, {2000, 2161}, {2001, 1964}, {2002, 2002}, {2003, 1827}, {2004, 1687}, {2005, 1885}, {2006, 1821}, {2007, 1854}, {2008, 1860}, {2009, 2351}, {2010, 2160}, {2011, 1982}, {2012, 2308}, {2013, 2075}, {2014, 1740}, {2015, 375}, {2016, 3}}

BarChart[numPerYear[[All, 2]], 
 ChartLabels -> 
  Placed[numPerYear[[All, 1]], {{0.5, 0}, {0.9, 1}}, 
   Rotate[#, (2/7) Pi] &]]

enter image description here

If we compile a list of the top ten words for each of these usable years, we might notice some trends.

Grid[Prepend[
  SortBy[Prepend[
       With[{sorted = 
          DeleteCases[
           SortBy[Tally[
             Flatten[
              StringSplit[ToLowerCase[#], 
                 RegularExpression@"[^\\w\\']+"] & /@ #[[2]]]], 
            1/#[[2]] &], {"vol" | "the" | "and" | "a" | "of" | "in" | 
             "no" | "to" | "for" | "is" | "1" | "10" | "13" | 
             Alternatives @@ (ToString /@ Range[2, 5]) | "ii" | 
             "iii" | Alternatives @@ (ToString /@ Range[2000, 2016]) |
              Alternatives @@ ("0" <> ToString[#] & /@ 
                Range[5, 9]), _}]}, 
        If[Length[sorted] >= 10, sorted[[;; 10]][[All, 1]], 
         sorted[[All, 1]]]], #[[1]]] & /@ withYear, #[[1]] &][[9 ;; -2]], 
{"Year", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}]]

enter image description here

I think you can kind-of see the zombie craze creeping up in the past few years, as the words 'dark,' 'night,' and 'dead' climb the charts. You can also see where we became obsessed with 3D for a little while.

If we bring back the trivial words we decided to exclude early on, you'll see that some games' titles include the year they were released and many include the following year.

Grid[Prepend[
  SortBy[Prepend[
       With[{sorted = 
          DeleteCases[
           SortBy[Tally[
             Flatten[
              StringSplit[ToLowerCase[#], 
                 RegularExpression@"[^\\w\\']+"] & /@ #[[2]]]], 
            1/#[[2]] &], {"the" | "and" | "a" | "of" | "in" | "no" | 
             "to" | "for" | "is", _}]}, 
        If[Length[sorted] >= 10, sorted[[;; 10]][[All, 1]], 
         sorted[[All, 1]]]], #[[1]]] & /@ withYear, #[[1]] &][[9 ;; -2]], 
{"Year", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}]]

enter image description here

In terms of platforms, the coverage was very spotty. Here you can see the number of games tagged by console. The fact that Linux is any significant presence should be a clue that some platforms are far overrepresented amongst tagged games.

platforms = {"Windows", "PS2", "PC-9801", "Linux/Unix", "PS1", "PS3",  "PSP", "Mobile", "Arcade", "NES", "Apple IIe", "C64", "X360", 
   "Amstrad CPC", "ZX Spectrum", "Xbox", "GameCube", "Wii U",  "ZX Spectrum 128", "MSX", "TI99", "SG-1000", "GBC", "N64", "iOS", 
   "FM7", "OS/2", "PS Vita", "Amiga", "Nintendo DS", "Wii", "Mac", "EXL 100", "Android", "Thomson", "3DS", "Dreamcast", "Atari ST", 
   "MS-DOS", "GBA", "C128", "Oric", "Chip 8", "PC-8801", "PCE CD/TG-CD", "Mega Drive / Genesis", "VIC-20", "X68000", 
   "Sharp X1", "Atari 8-bit", "Amiga AGA", "Famicom Disk System", "PCE / TurboGrafx", "MSX2", "GP2X", "BBC", "SNES", "Flash", "SMS", 
   "GameKing", "ColecoVision", "Neo Geo", "Apple IIGS", "HP-41", "Astrocade", "C16/Plus4", "Saturn", "BeOS", "Game Boy", 
   "Sega-CD / Mega-CD", "PC-6001", "Atari 2600", "Arcadia 2001", "Win3.1", "Epoc", "Atom", "3DO", "Mattel Aquarius", "Electron", 
   "ZX 81", "Dragon32", "Zeebo", "QL", "Enterprise", "Archimedes", "Virtual Boy", "TRS-80", "MicroBee", "Sharp MZ-700", "CD-i", 
   "FM Towns", "Game Gear", "Lynx", "Internet Only", "Odyssey\.b2", "Tandy", "Pico", "Atari 5200", "BK11M", "Intellivision", 
   "Atari 7800", "DEC PDP-1", "TI Calculators", "32X", "Jaguar", "PC-FX", "Pippin", "PET", "Odyssey\.b3", "Creativision", "PLATO", 
   "WinCE", "DVD player", "Vii", "V.Smile", "WSC", "Supervision", "N-Gage", "Xbox One", "PS4", "GP32", "Vectrex", "VG5000\[Micro]", 
   "SAM", "Atari Falcon", "custom", "Odyssey", "WS", "Mega LD", "R-Zone", "Cassette Vision", "Konix", "Game & Watch", "Sord M5", 
   "NGPC", "Tiger Game.COM", "Palm", "Game Master", "VC 4000", "Leapster", "Fairchild Ch. F", "Collector's Edition", 
   "Special Edition", "Limited Edition", "Gold Edition", "included games", "Shokai Genteiban", "Deluxe Edition", 
   "Limited Collector's Edition", "Genteiban", "Series 2"};

gamesByPlatform = #[[1, -1]] -> #[[All, ;; -2]] & /@ 
   GatherBy[
    Cases[gamelist, {__, Alternatives @@ platforms}], #[[-1]] &];

Length[Join @@ gamesByPlatform[[All, 2]]]

100455

platformTally = 
  SortBy[gamesByPlatform /. {HoldPattern[
       plat_String -> games_List] :> {plat, Length[games]}}, 
   1/#[[2]] &];

BarChart[platformTally[[;; 25, 2]], 
 ChartLabels -> 
  Placed[platformTally[[All, 1]], {{0.5, 0}, {0.9, 1}}, 
   Rotate[#, (2/7) Pi] &], ChartStyle -> "Pastel", AspectRatio -> 1/5]

enter image description here

If you're interested, here is a list of the top ten words by platform. Many of these platforms only have one or two titles listed, so you'll see some oddly specific words.

Export["~/Documents/RecreationalProgramming/GamesList/PlatformWords.\
jpg", Grid[
  DeleteCases[
   Prepend[SortBy[
      Prepend[With[{sorted = 
            DeleteCases[
             SortBy[Tally[
               Flatten[
                StringSplit[ToLowerCase[#], 
                   RegularExpression@"[^\\w\\']+"] & /@ #[[2]]]], 
              1/#[[2]] &], {"vol" | "the" | "and" | "a" | "of" | 
               "on" | "in" | "no" | "to" | "for" | "is" | "1" | "10" |
                "13" | Alternatives @@ (ToString /@ Range[2, 5]) | 
               "ii" | "iii" | 
               Alternatives @@ (ToString /@ Range[2000, 2016]) | 
               Alternatives @@ ("0" <> ToString[#] & /@ 
                  Range[5, 9]), _}]}, 
          If[Length[sorted] >= 10, sorted[[;; 10]][[All, 1]], 
           sorted[[All, 1]]]], #[[1]]] & /@ 
       Thread[gamesByPlatform[[All, 1]] -> 
         gamesByPlatform[[All, 2, All, 1]]], #[[1]] &][[
     9 ;; -2]], {"Platform", 1, 2, 3, 4, 5, 6, 7, 8, 9, 
     10}], {"Limited Collector's Edition" | "Collector's Edition" | 
     "Gold Edition" | "included games", __}]]]

enter image description here

Thanks for reading! If you're interested in exploring the dataset yourself, feel free to download my Mathematica notebook which is attached to this post below. I'd love to hear your suggestions of further analyses to do and other data sets to explore.

Attachments:
POSTED BY: Rob Lockhart
7 Replies

Hi Rob,

this is really nice. I haven't had much time, but I liked this representation:

data = Import["http://pastebin.com/DG1CsVXk", "Data"];
Quiet[names = (StringSplit[#, "("] & /@ data[[2, 2, 3 ;;]][[1 ;;]])[[All, 1]]];
WordCloud[DeleteStopwords@(ToString /@ DeleteCases[Flatten@(TextWords /@ DeleteDuplicates[Select[names, StringQ[#] &]]), {}])]

enter image description here

It gives an idea of what people are interested about in games. The Quiet function indicates that I was too lazy to deal with the cleaning of the data properly.

It is easy to generate a word cloud for different periods in time.

Cheers,

Marco

POSTED BY: Marco Thiel

Dear Rob,

very nice! Thanks for sharing. A small comment: you might want to use DeleteStopwords instead of your nontrivial.

Thanks,

Marco

POSTED BY: Marco Thiel

There is, of course, a lot more you can do. For example we can use the following website:

http://thegamesdb.net

This allows us to crosscheck the data we have looked at before. So if we take the names list from before:

data = Import["http://pastebin.com/DG1CsVXk", "Data"];
Quiet[names = (StringSplit[#, "("] & /@ data[[2, 2, 3 ;;]][[1 ;;]])[[All, 1]]];

We can use:

smalldataset = 
 Quiet[{"id" -> 
      Flatten[StringSplit[StringSplit[#, "<id>"], "</id>"]][[1]], 
     "GameTitle" -> 
      Flatten[StringSplit[StringSplit[#, "<GameTitle>"], 
         "</GameTitle>"]][[2]], 
     If[StringContainsQ[
       Flatten[StringSplit[StringSplit[#, "<ReleaseDate>"], 
          "</ReleaseDate>"]][[2]], "Platform"], 
      "ReleaseDate" -> "Missing", 
      "ReleaseDate" -> 
       Interpreter["Date"][ 
        Flatten[StringSplit[StringSplit[#, "<ReleaseDate>"], 
           "</ReleaseDate>"]][[2]]]]} & /@ ((StringSplit[#, 
         "<Game>\n"] & @(Import[
           "http://thegamesdb.net/api/GetGamesList.php?name=" <> #] \
&@ RandomChoice[names]))[[2 ;;]])]

To make a nice list of rules. Note that your database is much larger so many queries on http://thegamesdb.net will give empty sets or worse errors. Anyways, we an the use fancy things like

TimelinePlot[Association["GameTitle" -> "ReleaseDate" /. smalldataset]]

To obtain

enter image description here

This command gives 100 games:

smalldataset = 
 Quiet[{"id" -> 
      Flatten[StringSplit[StringSplit[#, "<id>"], "</id>"]][[1]], 
     "GameTitle" -> 
      Flatten[StringSplit[StringSplit[#, "<GameTitle>"], 
         "</GameTitle>"]][[2]], 
     If[StringContainsQ[
       Flatten[StringSplit[StringSplit[#, "<ReleaseDate>"], 
          "</ReleaseDate>"]][[2]], "Platform"], 
      "ReleaseDate" -> "Missing", 
      "ReleaseDate" -> 
       Interpreter["Date"][ 
        Flatten[StringSplit[StringSplit[#, "<ReleaseDate>"], 
           "</ReleaseDate>"]][[2]]]]} & /@ ((StringSplit[#, 
         "<Game>\n"] & @(Import[
           "http://thegamesdb.net/api/GetGamesList.php?name=" <> #] & /@ Import[
          "http://thegamesdb.net/api/GetGamesList.php?platform=PC"]))[[2 ;;]])]

We can again plot the TimeLinePlot:

TimelinePlot[Association[Select["GameTitle" -> "ReleaseDate" /. smalldataset, DateObjectQ[#[[2]]] &]]]

It is much nicer when it is interactive in the notebook, but it looks like this:

enter image description here

That shows quite nicely how much the market has grown. It also suggests clusters of release dates.

It is very easy to make a nice, orderly dataset out of this:

Dataset[Association /@ smalldataset]

enter image description here

There is certainly lots more to discover here.

Cheers,

Marco

POSTED BY: Marco Thiel

Oh, one more thing. Motivated by a post on the wolfram blog by Matthias Odisio you can, of course, also work with the covers of the video games. You can download a zip file with the covers from

http://www.gametdb.com/download.php?FTP=GameTDB-wii_cover-EN-2015-07-22.zip

You can then unzip the file and -after adjusting your file path- run:

covers = Import["~/Desktop/cover/EN/*", "PNG"];

There are more than 3000 covers in that dataset. Running the following takes to long now, so I only use 40 random covers to illustrate the idea, i.e. the same code that Matthias Odisio used:

covers2 = RandomChoice[covers, 40];
imagedistances = ConstantArray[0., {Length[covers2], Length[covers2]}];
Monitor[Do[d = ImageDistance[covers2[[i]], covers2[[j]], DistanceFunction -> "EarthMoverDistance"];
imagedistances[[i, j]] = imagedistances[[j, i]] = d,{i, 1, Length[covers2] - 1}, {j, i + 1, Length[covers2]}];, {i, j}];
allimagedistances = Flatten[Table[Diagonal[imagedistances, k], {k, 1, Length[covers2] - 1}]];

He then plotted everything like so:

thr = FindThreshold[allimagedistances, Method -> {"BlackFraction", .05}];
adjmatrix = 1 - Unitize[Threshold[imagedistances, thr]] - IdentityMatrix[Length[covers2]];
GraphPlot[adjmatrix, VertexRenderingFunction -> (Inset[covers2[[#2]], #, Center, .5] &), Method -> "SpringEmbedding", ImageSize -> Full]

enter image description here

None of this is my idea; all of the credit goes to Matthias Odisio. I only post it here, because the idea seems to fit nicely.

Cheers,

Marco

POSTED BY: Marco Thiel

Well done. Thanks! Do you mind if I post this on Gamasutra.com (where this article is cross-posted) as well?

With attribution of course.

POSTED BY: Rob Lockhart

Sure, no problem at all.

If you could provide your table with all the words in each year, or your initial data file as an attachment, we could make a very nice BubbleChart diagram over different years. I scraped the data in a very crude way, because the website did not load properly in my browser.

You might also be able to use google trends to cross-check the popularity of these games. Also, many of these games are described in great detail in wikipedia, which in Mathematica 10.2 is part of the Wolfram Language, i.e. is easy to access in the WL. You have all the titles of the games and could scrape useful information from Wikipedia. With that some really cool diagrams should be possible.

Cheers, Marco

POSTED BY: Marco Thiel

enter image description here - you earned "Featured Contributor" badge, congratulations !

This is a great post and it has been selected for the curated Staff Picks group. Your profile is now distinguished by a "Featured Contributor" badge and displayed on the "Featured Contributor" board.

POSTED BY: EDITORIAL BOARD
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract