Message Boards Message Boards

2018 FIFA World Cup Football / Soccer Simulation

winner map

This June 14, 2018 starts the FIFA World Cup in Russia. After reading El País predition (see (1) and (2)), I wondered if I could build my own simulation using Wolfram Language. So here it is what I tried. Any new ideas on how to improve it are welcome, and let the best win!

Introduction

I will build this simulation by blocks:

  1. Base data
  2. Simulation of a match
  3. Simulation of the groups phase
  4. Simulation of the second phase
  5. All together

1. Base data

1.1. Rankings

As a base for this model, I will use the FIFA ranking. We could also do a test using other rankings (like the one generated by El País (2)).

I've downloaded the ranking from FIFA, and after using Interpreter["Country"] on all the names, I could get most of the countries properly parsed. Of course, we all know UK is not united for football, so I used Interpreter["AdministrativeDivision"] for the UK home nations, and a few other cases needed special threatment (like the case of Tahiti, which is not a proper country, so I did align it with Tahiti Islands, or Congo vs Congo DR, which need special disambiguation). Here I will attach the result of that ranking, so we should just load it:

rankingPoints = Dispatch[Get["rankingFifaJune7.m"]]

1.2. Participants and Groups

Using the previous alignment, I took the list from FIFA and converted into a simple list of countries.

countries = {Entity["Country", "Russia"], 
  Entity["Country", "SaudiArabia"], Entity["Country", "Egypt"], 
  Entity["Country", "Uruguay"], Entity["Country", "Portugal"], 
  Entity["Country", "Spain"], Entity["Country", "Morocco"], 
  Entity["Country", "Iran"], Entity["Country", "France"], 
  Entity["Country", "Australia"], Entity["Country", "Peru"], 
  Entity["Country", "Denmark"], Entity["Country", "Argentina"], 
  Entity["Country", "Iceland"], Entity["Country", "Croatia"], 
  Entity["Country", "Nigeria"], Entity["Country", "Brazil"], 
  Entity["Country", "Switzerland"], Entity["Country", "CostaRica"], 
  Entity["Country", "Serbia"], Entity["Country", "Germany"], 
  Entity["Country", "Mexico"], Entity["Country", "Sweden"], 
  Entity["Country", "SouthKorea"], Entity["Country", "Belgium"], 
  Entity["Country", "Panama"], Entity["Country", "Tunisia"], 
  Entity["AdministrativeDivision", {"England", "UnitedKingdom"}], 
  Entity["Country", "Poland"], Entity["Country", "Senegal"], 
  Entity["Country", "Colombia"], Entity["Country", "Japan"]}

Given that it is sorted by groups, we could just use Partition and get the teams per group:

teamsByGroup = AssociationThread[CharacterRange["A", "H"] -> Partition[countries, 4]]

2. Simulation of a match

This is probably the hardest part of this problem. How to predict the outcome of a match? If it were easy, we would not even play! Well, I will use the same prediction used by other sources (like this paper), which is based on a Poisson distribution using the ration between the FIFA rankings. I'm adding also another factor that could be used to tweak more the probability (for example, Russia may not have a good ranking, but it is playing at home, that may be important!).

When it is not allowed to be a draw, instead of doing only an extra time and round of penalties if needed, I just called again the basic results to get a new result until no the tie is broken. To get an actual result we may want to call only once the Poisson distribution (for the extra time), and then maybe a binomial distribution for penalties.

getPossibleResult[weight1_, weight2_, lambda1_, lambda2_] := 
    {RandomVariate[PoissonDistribution[lambda1 weight1/weight2]], RandomVariate[PoissonDistribution[lambda2 weight2/weight1]]};

simulateMatch[weight1_, weight2_, lambda1_, lambda2_, drawQ_] := Module[{result}, 
    result = getPossibleResult[weight1, weight2, lambda1, lambda2];
    If[!drawQ,
        While[SameQ @@ result,
            result = getPossibleResult[weight1, weight2, lambda1, lambda2]
        ];
    ];
    result
];

simulatedMatchResults[match["Groups"][team1_, team2_]] := simulateMatch[team1 /. rankingPoints, team2 /. rankingPoints, 1, 1, True];
simulatedMatchResults[match["Knockout"][team1_, team2_]] := simulateMatch[team1 /. rankingPoints, team2 /. rankingPoints, 1, 1, False];

In this first model I'm using 1 for the lambda values, but we could generate another dispatch list to modify for example Russia factor for playing as local, or adding more strengh to our favorite team, etc. Of course, if you change too much the values, remember not to use the result for doing any bet!

3. Simulation of the groups phase

To simulate a each group, we will play each match, and then we will compute the full table of results sorted by points, goals difference, goals done, and goals against.

simulateGroup[groupsTeam_] := 
Module[{matches = DeleteDuplicatesBy[Permutations[groupsTeam, {2}], Sort], matchResults, resultsTable, resultsLookup},
    resultsLookup[_, _] = 0;
    matchResults = {##, simulatedMatchResults[match["Groups"][##]]} & @@@ matches;
    Function[{team1, team2, result},
        Switch[result,
            {a_, b_} /; a > b,
                resultsLookup[team1, "WINS"] = resultsLookup[team1, "WINS"] + 1;
                resultsLookup[team2, "LOSTS"] = resultsLookup[team2, "LOSTS"] + 1;
                resultsLookup[team1, "POINTS"] = resultsLookup[team1, "POINTS"] + 3;
            ,
            {a_, b_} /; a < b,
                resultsLookup[team2, "WINS"] = resultsLookup[team2, "WINS"] + 1;
                resultsLookup[team1, "LOSTS"] = resultsLookup[team1, "LOSTS"] + 1;
                resultsLookup[team2, "POINTS"] = resultsLookup[team2, "POINTS"] + 3;
            ,
            _
            ,
                resultsLookup[team1, "DRAWS"] = resultsLookup[team1, "DRAWS"] + 1;
                resultsLookup[team2, "DRAWS"] = resultsLookup[team2, "DRAWS"] + 1;
                resultsLookup[team1, "POINTS"] = resultsLookup[team1, "POINTS"] + 1;
                resultsLookup[team2, "POINTS"] =  resultsLookup[team2, "POINTS"] + 1;
        ];
        resultsLookup[team1, "GOL+"] = resultsLookup[team1, "GOL+"] + result[[1]];
        resultsLookup[team2, "GOL+"] = resultsLookup[team2, "GOL+"] + result[[2]];
        resultsLookup[team1, "GOL-"] = resultsLookup[team1, "GOL-"] + result[[2]];
        resultsLookup[team2, "GOL-"] = resultsLookup[team2, "GOL-"] + result[[1]];
    ] @@@ matchResults;
    resultsTable = Table[
        {
            t,
            resultsLookup[t, "POINTS"],
            resultsLookup[t, "WINS"],
            resultsLookup[t, "LOSTS"],
            resultsLookup[t, "DRAWS"],
            resultsLookup[t, "GOL+"],
            resultsLookup[t, "GOL-"],
            resultsLookup[t, "GOL+"] - resultsLookup[t, "GOL-"]
        }, {t, groupsTeam}];
    SortBy[resultsTable, {-#[[2]], -#[[-1]], -#[[-3]], #[[-2]]} &]
];

selectClassified[groupResultTable_] := groupResultTable[[;;2, 1]];

simulateAllGroups[groups_] := simulateGroup /@ groups;

selectAllClassified[groupsResults_] := AssociationMap[List["1"<>#[[1]]->#[[2, 1]],"2"<>#[[1]]->#[[2, 2]]] &, selectClassified /@ groupsResults]

4. Simulation of the second phase

4.1. Matches schedule

First step here is to get the match schedule. Who plays against who in each knockout round? Well, that's easy to get from FIFA, and as a way to store it, I will do a list of rules of matchnumber -> {team1, team2}, where team1 and team2 will be strings with the form "1"|"2" ~~ group or "W"|"L" ~~ matchnumber.

roundOf16 = Thread[Range[49, 56] -> {{"1C", "2D"}, {"1A", "2B"}, {"1B", "2A"}, {"1D", "2C"}, {"1E", "2F"}, {"1G", "2H"}, {"1F", "2E"}, {"1H", "2G"}}];

quarters = Thread[Range[57, 60] -> {{"W49", "W50"}, {"W53", "W54"}, {"W55", "W56"}, {"W51", "W52"}}];

semifinals = {61 -> {"W57", "W58"}, 62 -> {"W59", "W60"}};

thirdplace = {63 -> {"L61", "L62"}};

final = {64 -> {"W61", "W62"}}

With all this we could define the competition as a succesive group of rounds:

competition = {roundOf16, quarters, semifinals, Join[thirdplace, final]};

4.2. Playing the phase

Playing a round means doing a match and returning the results, to make it easier, the results will accumulate, so that we can then use Fold to have the total results.

playRound[round_, previousWL_] := Join[
    previousWL,
    Association@Flatten@Map[
        Thread[
            {"L" <> ToString[#[[1]]], "W" <> ToString[#[[1]]]} -> 
            #[[2]][[Ordering[simulatedMatchResults[match["Knockout"] @@ #[[2]]]]]]
        ] &,
        round /. previousWL
    ]
];

playRounds[groupsResults_] := Fold[playRound[#2, #1] &, groupsResults, competition]

4.3. First basic stats

With all the results, the first stats I could think about were in which position each team finished. I only cared about the teams that actually reached at least the second phase. Of course if we run this enough times, all the teams will get into the second phase at least once... I guess.

First, let's define a function that will return the team position depending on the match they lost (or won, for third and first place):

posByLost = Flatten[Thread /@ {
    ("L" <> ToString[#] & /@ roundOf16[[All, 1]]) -> 9,
    ("L" <> ToString[#] & /@ quarters[[All, 1]]) -> 5,
    "L63" -> 4, "W63" -> 3, "L64" -> 2, "W64" -> 1, _ -> None
}]

(posByLostFun[#1] = #2) & @@@ posByLost;

If we apply that function to the result of playRounds we can get which position each team got:

getPosition[globalResults_] := Reverse[Sort[DeleteCases[KeyValueMap[posByLostFun[#1] -> #2 &, globalResults], None -> _]], {2}]

And if we have multiple resutls of getPosition we would need to do a summary:

summaryPositions[positions_] := SortBy[
    With[{n = Length[positions]},
        GroupBy[Join @@ positions, First -> Last, Sort[(#1 -> N[#2/n] & @@@ Tally[#])] &]
    ],
    -Accumulate[Lookup[#, {1, 2, 3, 4, 5, 9}, 0]] &
];

But, given that it is more common to talk about the winner, finalist, semifinalist, etc, let's make a function to compute that exactly, using as input the previous result:

accumulatePositions[pos_] := Association@
    ReleaseHold[
        {
            "Winner" -> Hold[1],
            "Finalist" -> Hold[1] + Hold[2], 
            "Semifinalist" -> Hold[1] + Hold[2] + Hold[3] + Hold[4], 
            "Quarters" -> Hold[1] + Hold[2] + Hold[3] + Hold[4] + Hold[5], 
            "Round16" -> Hold[1] + Hold[2] + Hold[3] + Hold[4] + Hold[5] + Hold[9]
        } /. Append[pos, _Integer -> 0]
    ];

4.4. Other ideas for more stats

Some very custom stats could be done too, like for example, getting the most probably final, or the against which teams will play your favorite team.

5. All together

Now we have all the tools needed to run the world cup n times:

simulateWC[n_] := accumulatedPositions[
    summaryPositions[
        Table[
            getPosition[
                playRounds[
                    selectAllClassified[
                        simulateAllGroups[
                            teamsByGroup
                        ]
                    ]
                ]
            ],
            n
        ]
    ]
];

And we need to have a nice display of it:

displaySimulationResult[sr_] := (Dataset`$DatasetTargetRowCount = 40; Dataset[sr /. x_?NumericQ :> Quantity[100 x, "Percent"]])

(I know, and I always say it, Dataset is for much more than just displaying tables, but they look great).

5.1. Using it!

Now we can play with our functions, and do simulations of 10000 or 100000 or even more if you have memory and time, and get the probabilities by just counting.

Some results:

simulation100k = simulateWC[100000];

(This took 473 seconds in my computer, i7-4710HQ ).

And now we can display them:

displaySimulationResult[simulation100k]

display simulation 100k

5.1.1. Why not a map too?

Let's make a nice map of the winners probability and about reaching the second phase:

Labeled[GeoRegionValuePlot[Quantity[100 #["Winner"], "Percent"] & /@ simulation100k, 
    GeoBackground -> "CountryBorders", GeoProjection -> "Robinson", ImageSize -> 1200, 
    PlotLabel -> Style["Probability of winning the 2018 FIFA World Cup Russia", 24]],
    Text["Using 100000 simulations based on FIFA ranking from June 7, 2018"]
]

winner map

Labeled[GeoRegionValuePlot[Quantity[100 #["Round16"], "Percent"] & /@ simulation100k, 
    GeoBackground -> "CountryBorders", GeoProjection -> "Robinson", ImageSize -> 1200, 
    PlotLabel -> Style["Probability of getting into second phase in the 2018 FIFA World Cup Russia", 24]], 
    Text["Using 100000 simulations based on FIFA ranking from June 7, 2018"]
]

second phase map

5.2. Desired final!

What about computing the probability of your desired final? It shouldn't be that hard given all this, it would be a matter of getting the joint probability of your teams being in first or second position simultaneously. Let's say we want a Peru-Brazil final -- and of course, expecting that Peru wins :).

We could try simply 10000 times this time:

Block[{n = 10000, c = 0},
    Do[
        c += Boole[MatchQ[Sort[{"W64", "L64"} /. playRounds[selectAllClassified[simulateAllGroups[teamsByGroup]]]], {Entity["Country", "Brazil"], Entity["Country", "Peru"]}]];
        , n
    ];
    Quantity[100.0 c/n, "Percent"]
]

And after doing that I got 0.85%, not as good as we would like.

Conclusions

  1. This is going to be fun!
  2. This simulation is based on a given ranking, but it is easy to change it, according to different expectations and factors.
  3. As a pending task remains the addition of a meaningful extra factor for each team (or even eath team pair).
Attachments:
11 Replies
Posted 6 years ago

Very interesting. The top 16 teams with better prob to make the second phase were almost the same that actually advanced. The only issue I see is using the FIFA ranking as it is not a very good way of actually ranking the teams. Great job.

POSTED BY: Anne Summers

Germany, Brazil and Portugal lost. In the forecast these teams occupy the first three places. The analysis is done correctly, but the result is erroneous.

Posted 6 years ago

From what I can see most of the analysis so far is taken from "static" data, ranking, host nation advantage, size of midfielder etc.. the cup is an evolving competition where highs from a previous game can soon turn to lows e.g. germany/sweden high followed by germany/south korea low or where injuries incurred in previous games can affect whether key players can play in subsequent knockout matches e.g. Uruguay's Cavani, and therefore negatively influence the result, or whether yellow cards given in previous matches can have a similar affect on subsequent matches. I can imagine that if on-going "live" data was fed into the computation we would end up with an evermore plausible winner :-) As seen by Apps like "Tor alarm" (german for goal alarm) live data such as: team makeup, substitutions, when the game starts/ends, yellow cards, goals and by whom is available online.

POSTED BY: Kieran Ryan
Posted 6 years ago

Hi All, Great to see interest in the world cup. I have two notebooks about the world cup. One is the history of the world cup displayed on a world map. The other is the results, calendar, stats and other interesting facets of the current world cup, also displayed on the world map. I think it is very compelling because so much info is available on one screen. I would love to share them but trying to do so using Wolfram Cloud seems to be a complete disaster. They wont even load and when they do, the controls are all out of whack. I'm just a hobbyist. If one of you has some time to look at the code to see if there is any hope of sharing it on the cloud, I would be delighted for your assistance. Here is my direct email in case you want to contact me directly. A link to the code and the code itself follow below. markstevenlawton@gmail.com

here is the code: https://drive.google.com/file/d/1r7AGWaUYd7j3xT0St7p5VUxrQd-LBPzE/view?usp=sharing

Attachments:
POSTED BY: Mark Lawton

Germany lost, who is the next champion?

Great post @Francisco thank you for sharing! Perhaps we can get a more accurate prediction if we include info about each player. Fifa provides some nice info about the 700+ players that will participate in the World Cup https://www.fifa.com/worldcup/players/ . So I created the following dataset (see it attached):

datasetWC18 = Import["WC18PlayersDataset.m"]

enter image description here

Age of the players might be a relevant feature. But the most relevant info is still missing .... players market value and achievements within their clubs (there are several places where this info is available, but I haven't had time to include it).

Anyway let's do some nice visualization with this players dataset as a start. For example I was curious about the correlation between "Height" / "Weight" and their "Field Position". The results are the following (Note that I'm using ListPointPlot3D to associate a 3rd dimension to players indicating their "Field Position") :

fieldPositions = {"GK", "DF", "MF", "FW"};

ListPointPlot3D[
    Table[
            Map[Append[#, "PositionN" -> i] &, 
                    datasetWC18[Select[#Position == fieldPositions[[i]] &], {"Height", "Weight"}]
                    ]
      , {i, 4}],
 ImageSize -> 800,
 PlotStyle -> PointSize[0.012],
 PlotLegends -> SwatchLegend@{Style["Goalkeepers", 18], Style["Defenders", 18],   Style["Midfielders", 18], Style["Forwards", 18]},
 AxesLabel -> {Style["cm", 18, Bold], Style["kg", 18, Bold]},
 PlotLabel -> Style["Players height vs weight", 20, Bold]]

enter image description here

As expected the Goalkeepers on average are the tallest and heaviest. What I wasn't expecting was that on average the Midfielders are the smallest. Perhaps it is because Midfielders best asset is their strategic mind to orchestrate the game from the middle with clever ball passes.

Projecting the flags of World Cup countries

Borrowing Francisco's code for countries entities one can easily get all their flags, except the one from England which is missing (so, I added it manually):

countries = {Entity["Country", "Russia"], 
   Entity["Country", "SaudiArabia"], Entity["Country", "Egypt"], 
   Entity["Country", "Uruguay"], Entity["Country", "Portugal"], 
   Entity["Country", "Spain"], Entity["Country", "Morocco"], 
   Entity["Country", "Iran"], Entity["Country", "France"], 
   Entity["Country", "Australia"], Entity["Country", "Peru"], 
   Entity["Country", "Denmark"], Entity["Country", "Argentina"], 
   Entity["Country", "Iceland"], Entity["Country", "Croatia"], 
   Entity["Country", "Nigeria"], Entity["Country", "Brazil"], 
   Entity["Country", "Switzerland"], Entity["Country", "CostaRica"], 
   Entity["Country", "Serbia"], Entity["Country", "Germany"], 
   Entity["Country", "Mexico"], Entity["Country", "Sweden"], 
   Entity["Country", "SouthKorea"], Entity["Country", "Belgium"], 
   Entity["Country", "Panama"], Entity["Country", "Tunisia"], 
   Entity["AdministrativeDivision", {"England", "UnitedKingdom"}], 
   Entity["Country", "Poland"], Entity["Country", "Senegal"], 
   Entity["Country", "Colombia"], Entity["Country", "Japan"]};

flags = EntityValue[countries, "Flag"];

enter image description here

flagsAll = (flags /. {x_Missing -> englandFlag})

enter image description here

Finally with GeoGraphics we can create an interactive map using Tooltip as follows:

GeoGraphics[
{EdgeForm[ Directive[Thin, Black]],
{GeoStyling[{"Image", #2}], Tooltip[Polygon[#1], CommonName[#1]]} & @@@ Transpose[{countries, flagsAll}]}, 
Background -> "CountryBorders", 
GeoProjection -> "Robinson", ImageSize -> 1200, 
PlotLabel ->  Style["Countries in the 2018 FIFA World Cup Russia", 24]]

enter image description here

Population of countries in the 2018 FIFA World Cup

In my opinion two other interesting features to look for in the participating countries are their Population and GDP. Which can be easily obtained using EntityValue :

valsPopulation = EntityValue[countries, "Population"];
GeoRegionValuePlot[Thread[countries -> valsPopulation], 
 GeoBackground -> "CountryBorders", GeoProjection -> "Robinson", 
 ImageSize -> 1200, PlotLabel ->  Style["Population of countries in the 2018 FIFA World Cup Russia",  24]]

enter image description here

GDP of countries in the 2018 FIFA World Cup

valsGDP = EntityValue[countries, "GDP"];

GeoRegionValuePlot[Thread[countries -> valsGDP], 
 GeoBackground -> "CountryBorders", GeoProjection -> "Robinson", 
 ImageSize -> 1200, PlotLabel -> Style["GDP of countries in the 2018 FIFA World Cup Russia", 24]]

enter image description here

Attachments:

Yes, we can do a whole set of comparisons for the countries playing the world cup.

About the players data, very nice data, but anyway it is always different how they play in their national selections than in their regular teams. And no match is decided until it is played!

Anyway I wonder if we could also classify the teams by how well they fit into that model of player features by position. Is a team with all its players closer to the avg of that plot a better team? That kind of questions could be interesting to compute.

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD

This is great! @Francisco thank you for sharing! I wonder how your predictions compare to this recent article:

Prediction of the FIFA World Cup 2018 – A random forest approach with an emphasis on estimated team ability parameters https://arxiv.org/abs/1806.03208

which was also covered in MIT Technology Review blog: Machine learning predicts World Cup winner. There basic result is the following tree of wins:

enter image description here

POSTED BY: Vitaliy Kaurov

Very interesting link, and nice way to display the final output, I wonder if with the output of my functions we could generate also a complete set of the most possible fixture of matches.

During the past World Cup a close analysis was done by @Etienne Bernard with very neat machine learning techniques:

The similar visualization was:

enter image description here

enter image description here

POSTED BY: Vitaliy Kaurov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract