Hi, I'm Rob Lockhart, Creative Director of Important Little Games. I'd be grateful if you followed me on twitter.
It all started when I stumbled across this misleadingly-titled Polygon article written last year and followed the link to the data source out of curiosity. Basically it's just a list of videogame titles, some of which have been annotated with a developer, a year, and/or a platform. Since I'm fond of semi-structured data sources, I downloaded the list, which had grown to nearly 150,000 titles since the Polygon article was published, and started to play around in Wolfram Language. As you read on, be advised that this is an extremely noisy dataset and does not necessarily reflect the videogames industry's history, or even the titles it lists. Here is import and initial cleaning:
gameliststring = Import["~/Documents/list_of_every_video_game_ever_(v3).txt"];
gamelist = (StringTrim[#,
RegularExpression["[\\s\\(\\)]*"]] & /@ (StringSplit[#,
RegularExpression["\\)?\\s*\\("]] &)) /@
StringSplit[gameliststring, "\n"][[4 ;;]];
Length[gamelist]
149665
The first thing I did was take a look at the top words that occur in videogame titles. There were 150,000 game titles and a vocabulary of around 45,000 unique words. About 21,000 of these were used only once in any game title. For scale, consider that apparently it is not uncommon for a native speaker to have 20,000-35,000 words in their whole vocabulary.
Let's take a look at the top 50 words I found:
titlewords = SortBy[Tally[Flatten[StringSplit[ToLowerCase[#],
RegularExpression@"[^\\w\\']+"] & /@ gamelist[[All, 1]]]], 1/#[[2]] &];
BarChart[titlewords[[;; 50, 2]],
ChartLabels -> Placed[titlewords[[;; 50, 1]], {{0.5, 0}, {0.9, 1}},
Rotate[#, (2/7) Pi] &], ChartStyle -> 24, ImageSize -> {800, 500}]
There are a lot of words that are completely unsurprising, as they are overwhelmingly frequent throughout English. Numerals, both Arabic and Roman, play a big role, meaning that there are a lot of sequels. Frustrating for those of us who value originality in interactive entertainment, but by no means surprising. Let's filter out these uninteresting results and look again:
nontrivial =
SortBy[ReplaceRepeated[
DeleteCases[
titlewords, {"vol" | "the" | "and" | "a" | "of" | "in" | "no" |
"to" | "for" | "is" | "1" | "10" | "13" |
Alternatives @@ (ToString /@ Range[2, 5]) | "ii" | "iii" |
Alternatives @@ (ToString /@ Range[2000, 2016]) |
Alternatives @@ ("0" <> ToString[#] & /@
Range[5, 9]), _}], {a___, {b_, c_}, d___, {e_, f_}, g___} /;
StringMatchQ[e, b ~~ "s" | "es"] :> {a, {b, c + f}, d, g},
MaxIterations -> 25], 1/#[[2]] &];
BarChart[nontrivial[[;; 50, 2]],
ChartLabels ->
Placed[nontrivial[[;; 50, 1]], {{0.5, 0}, {0.9, 1}},
Rotate[#, (2/7) Pi] &], ChartStyle -> 2, ImageSize -> {800, 500}]
Length[Cases[titlewords, {_, 1}]]
21797
Length[titlewords]
44577
I also recombined plurals into the root word.
In my humble opinion, it really sucks that 'war' shows up second, after 'game.' There's nothing wrong with war as a theme for any particular game, but our industry's singular focus on war and violence becomes pretty tiresome, as this chart exemplifies. Which word would I prefer in second place? 'Magic,' of course!
I also noticed that there were quite a lot of games which use subtitles. Not the written dialogue at the bottom of the cutscenes, but the second part of a title separated by a colon. Things like the underlined part of "Call of Warfare: Modern Videogame ." Let's take a look at the most common subtitles:
subtitles =
SortBy[Tally[((StringSplit[#,
": "] /. {a_} :> {"No Subtitle"})[[-1]] & /@
gamelist[[All, 1]])], 1/#[[2]] &][[;; 500]];
BarChart[subtitles[[2 ;; 50, 2]],
ChartLabels ->
Placed[subtitles[[2 ;; 50, 1]], {{0.5, 0}, {0.9, 1}},
Rotate[#, (2/7) Pi] &] , ChartStyle -> "Rainbow"] /.
ImageScaled[{1/2, 1}] -> ImageScaled[{0.9, 1}]
'The Game' and 'Gold Edition' seem to make sense, but for some reason 'The Movie' comes in third. Why are there so many games (56) with ': The Movie' in the title?!
I'm not very fond of this naming pattern in the first place, but some of these should unquestionably be retired. Let's not name any more games "Something Something: Vengeance" shall we?
As I mentioned earlier, some of the entries in the data are tagged with a developer, year, and/or platform. I found the developers more or less impossible to extract systematically, but I had better luck with years and platforms.
About 1/5 of the games were tagged with a year, but they were represented unevenly. As you can see below, only the years from 2000 to 2015 had any kind of decent coverage. It's interesting to note that within that period, the number of games released per year did not increase or decrease significantly (if this dataset can be taken as a representative sample).
withYear = #[[1, 2]] -> #[[All, 1]] & /@
GatherBy[
Cases[gamelist, {name_, ___,
a_String /;
StringMatchQ[a,
RegularExpression[
"[\\w\\s,]*(?:19[789][0-9])|(?:20[01][0-9])[\\w\\s,]*"]], \
___} :> {name,
ToExpression[
StringCases[a,
RegularExpression@"(?:19[789][0-9])|(?:20[01][0-9])"][[
1]]]}], #[[2]] &];
N[Length[Join @@ withYear[[All, 2]]]/Length[gamelist]]
0.200889
numPerYear =
Thread[{Range[1984, 2016],
Replace[(Range[1984, 2016] /. withYear), {a_List :> Length[a],
b_?NumericQ :> 0}, 1]}]
{{1984, 1}, {1985, 0}, {1986, 0}, {1987, 0}, {1988, 1}, {1989, 0}, {1990, 2}, {1991, 3}, {1992, 0}, {1993, 0}, {1994, 1}, {1995, 0}, {1996, 1}, {1997, 0}, {1998, 1}, {1999, 1}, {2000, 2161}, {2001, 1964}, {2002, 2002}, {2003, 1827}, {2004, 1687}, {2005, 1885}, {2006, 1821}, {2007, 1854}, {2008, 1860}, {2009, 2351}, {2010, 2160}, {2011, 1982}, {2012, 2308}, {2013, 2075}, {2014, 1740}, {2015, 375}, {2016, 3}}
BarChart[numPerYear[[All, 2]],
ChartLabels ->
Placed[numPerYear[[All, 1]], {{0.5, 0}, {0.9, 1}},
Rotate[#, (2/7) Pi] &]]
If we compile a list of the top ten words for each of these usable years, we might notice some trends.
Grid[Prepend[
SortBy[Prepend[
With[{sorted =
DeleteCases[
SortBy[Tally[
Flatten[
StringSplit[ToLowerCase[#],
RegularExpression@"[^\\w\\']+"] & /@ #[[2]]]],
1/#[[2]] &], {"vol" | "the" | "and" | "a" | "of" | "in" |
"no" | "to" | "for" | "is" | "1" | "10" | "13" |
Alternatives @@ (ToString /@ Range[2, 5]) | "ii" |
"iii" | Alternatives @@ (ToString /@ Range[2000, 2016]) |
Alternatives @@ ("0" <> ToString[#] & /@
Range[5, 9]), _}]},
If[Length[sorted] >= 10, sorted[[;; 10]][[All, 1]],
sorted[[All, 1]]]], #[[1]]] & /@ withYear, #[[1]] &][[9 ;; -2]],
{"Year", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}]]
I think you can kind-of see the zombie craze creeping up in the past few years, as the words 'dark,' 'night,' and 'dead' climb the charts. You can also see where we became obsessed with 3D for a little while.
If we bring back the trivial words we decided to exclude early on, you'll see that some games' titles include the year they were released and many include the following year.
Grid[Prepend[
SortBy[Prepend[
With[{sorted =
DeleteCases[
SortBy[Tally[
Flatten[
StringSplit[ToLowerCase[#],
RegularExpression@"[^\\w\\']+"] & /@ #[[2]]]],
1/#[[2]] &], {"the" | "and" | "a" | "of" | "in" | "no" |
"to" | "for" | "is", _}]},
If[Length[sorted] >= 10, sorted[[;; 10]][[All, 1]],
sorted[[All, 1]]]], #[[1]]] & /@ withYear, #[[1]] &][[9 ;; -2]],
{"Year", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}]]
In terms of platforms, the coverage was very spotty. Here you can see the number of games tagged by console. The fact that Linux is any significant presence should be a clue that some platforms are far overrepresented amongst tagged games.
platforms = {"Windows", "PS2", "PC-9801", "Linux/Unix", "PS1", "PS3", "PSP", "Mobile", "Arcade", "NES", "Apple IIe", "C64", "X360",
"Amstrad CPC", "ZX Spectrum", "Xbox", "GameCube", "Wii U", "ZX Spectrum 128", "MSX", "TI99", "SG-1000", "GBC", "N64", "iOS",
"FM7", "OS/2", "PS Vita", "Amiga", "Nintendo DS", "Wii", "Mac", "EXL 100", "Android", "Thomson", "3DS", "Dreamcast", "Atari ST",
"MS-DOS", "GBA", "C128", "Oric", "Chip 8", "PC-8801", "PCE CD/TG-CD", "Mega Drive / Genesis", "VIC-20", "X68000",
"Sharp X1", "Atari 8-bit", "Amiga AGA", "Famicom Disk System", "PCE / TurboGrafx", "MSX2", "GP2X", "BBC", "SNES", "Flash", "SMS",
"GameKing", "ColecoVision", "Neo Geo", "Apple IIGS", "HP-41", "Astrocade", "C16/Plus4", "Saturn", "BeOS", "Game Boy",
"Sega-CD / Mega-CD", "PC-6001", "Atari 2600", "Arcadia 2001", "Win3.1", "Epoc", "Atom", "3DO", "Mattel Aquarius", "Electron",
"ZX 81", "Dragon32", "Zeebo", "QL", "Enterprise", "Archimedes", "Virtual Boy", "TRS-80", "MicroBee", "Sharp MZ-700", "CD-i",
"FM Towns", "Game Gear", "Lynx", "Internet Only", "Odyssey\.b2", "Tandy", "Pico", "Atari 5200", "BK11M", "Intellivision",
"Atari 7800", "DEC PDP-1", "TI Calculators", "32X", "Jaguar", "PC-FX", "Pippin", "PET", "Odyssey\.b3", "Creativision", "PLATO",
"WinCE", "DVD player", "Vii", "V.Smile", "WSC", "Supervision", "N-Gage", "Xbox One", "PS4", "GP32", "Vectrex", "VG5000\[Micro]",
"SAM", "Atari Falcon", "custom", "Odyssey", "WS", "Mega LD", "R-Zone", "Cassette Vision", "Konix", "Game & Watch", "Sord M5",
"NGPC", "Tiger Game.COM", "Palm", "Game Master", "VC 4000", "Leapster", "Fairchild Ch. F", "Collector's Edition",
"Special Edition", "Limited Edition", "Gold Edition", "included games", "Shokai Genteiban", "Deluxe Edition",
"Limited Collector's Edition", "Genteiban", "Series 2"};
gamesByPlatform = #[[1, -1]] -> #[[All, ;; -2]] & /@
GatherBy[
Cases[gamelist, {__, Alternatives @@ platforms}], #[[-1]] &];
Length[Join @@ gamesByPlatform[[All, 2]]]
100455
platformTally =
SortBy[gamesByPlatform /. {HoldPattern[
plat_String -> games_List] :> {plat, Length[games]}},
1/#[[2]] &];
BarChart[platformTally[[;; 25, 2]],
ChartLabels ->
Placed[platformTally[[All, 1]], {{0.5, 0}, {0.9, 1}},
Rotate[#, (2/7) Pi] &], ChartStyle -> "Pastel", AspectRatio -> 1/5]
If you're interested, here is a list of the top ten words by platform. Many of these platforms only have one or two titles listed, so you'll see some oddly specific words.
Export["~/Documents/RecreationalProgramming/GamesList/PlatformWords.\
jpg", Grid[
DeleteCases[
Prepend[SortBy[
Prepend[With[{sorted =
DeleteCases[
SortBy[Tally[
Flatten[
StringSplit[ToLowerCase[#],
RegularExpression@"[^\\w\\']+"] & /@ #[[2]]]],
1/#[[2]] &], {"vol" | "the" | "and" | "a" | "of" |
"on" | "in" | "no" | "to" | "for" | "is" | "1" | "10" |
"13" | Alternatives @@ (ToString /@ Range[2, 5]) |
"ii" | "iii" |
Alternatives @@ (ToString /@ Range[2000, 2016]) |
Alternatives @@ ("0" <> ToString[#] & /@
Range[5, 9]), _}]},
If[Length[sorted] >= 10, sorted[[;; 10]][[All, 1]],
sorted[[All, 1]]]], #[[1]]] & /@
Thread[gamesByPlatform[[All, 1]] ->
gamesByPlatform[[All, 2, All, 1]]], #[[1]] &][[
9 ;; -2]], {"Platform", 1, 2, 3, 4, 5, 6, 7, 8, 9,
10}], {"Limited Collector's Edition" | "Collector's Edition" |
"Gold Edition" | "included games", __}]]]
Thanks for reading! If you're interested in exploring the dataset yourself, feel free to download my Mathematica notebook which is attached to this post below. I'd love to hear your suggestions of further analyses to do and other data sets to explore.
Attachments: