# [WSSA16] Predicting Winning Odds of La Liga Matches

GROUPS:

## Abstract

La Liga is underway. Millions of fans are looking forward to their favorite teams and players. As every year, 20 clubs will struggle to win one of the most prestigious titles in Europe, and with the start of the league, many gamblers will also try to prove themselves worthy by betting on the matches. This project aims at developing a program which can give betting predictions on the upcoming matches of the league. It is implemented using machine learning techniques and the broad tool-set that is provided by Mathematica. The full source code of this project can be downloaded from here. To accomplish our goal we collected data to feed the algorithms of the machine learning functions Predict and Classify. We crawled data of the last 6 seasons (all matches in between 2010 and 2016) of Spanish League from a sports statistics website using Scrapy [1]. The data includes more than 2200 matches along with numerous relevant statistics. After that the data passed through multiple stages of refinement. Since for a single match there are numerous odds on different events (e.g. total goals, outcome, odds for goalscorers etc) we decided to keep it simple and chose to predict the outcome - one the most common events. We computed and added to the data several specific features, such as ratings, head to head statistics, that mostly affect the rightness of corresponding event. After adjusting the data we used built in Mathematica functions Predict and Classify to predict the events. These functions are used to give suggestions on the outcome of the match. We then tested our predictions on the actual odds of our matches. The odds were crawled from another website which contained all of the winning odds of Spanish League matches played in between 2010 to 2016. This data allowed us to simulate betting and try to estimate the profitability of our program. Finally, based on our calculations we suggest betting strategies that satisfy the most optimal risk-profit relationship.

## Data Transformation And Filtering

### Data Transformation

To start with, we imported our csv formatted match data and the ratings into Mathematica. Below is shown a sample of the match data csv file.

match_date,team_1,team_2,team_win,coef_1,coef_x,coef_2,coef_win,score 15-05-2016,Betis,Getafe,Betis,3.49,3.73,2.06,3.49,2:1 15-05-2016,Gijon,Villarreal,Gijon,1.50,4.64,6.31,1.50,2:0 15-05-2016,Rayo Vallecano,Levante,Rayo Vallecano,1.36,5.77,7.60,1.36,3:1 15-05-2016,Espanyol,Eibar,Espanyol,2.05,3.68,3.75,2.05,4:2 15-05-2016,Malaga,Las Palmas,Malaga,1.69,3.94,5.17,1.69,4:1

Then we used the function SemanticImport, which returns a Mathematica dataset. From the match data we imported 8 columns which are team_1, team_2, date, score etc. Then we did data cleaning to make it easier to work later.

oddsData = SemanticImport[
"/home/mher/Desktop/BPS_Project/Data/oddsportal_selenium/output.csv",
<|1->"Date",2->"String",3->"String",5->"String",6->"String",7->"String",8->"String",9->"String"|>
];

oddsToNumbers[a_Association]:=
If[
StringCount[a["coef_1"],"/"]== 0,
ReplacePart[a,
{
"coef_1" ->ToExpression[a["coef_1"]],
"coef_x" ->ToExpression[a["coef_x"]],
"coef_2" ->ToExpression[a["coef_2"]],
"coef_win" ->ToExpression[a["coef_win"]]
}
],
ReplacePart[a,
{
"coef_1" ->N[Divide @@ Map[FromDigits,StringSplit[a["coef_1"],"/"]] ,2] + 1.,
"coef_x" ->N[Divide @@ Map[FromDigits,StringSplit[a["coef_x"],"/"]] ,2] +  1.,
"coef_2" ->N[Divide @@ Map[FromDigits,StringSplit[a["coef_2"],"/"]] ,2] +  1.,
"coef_win" ->N[Divide @@ Map[FromDigits,StringSplit[a["coef_win"],"/"]] ,2] +  1.
}
]
]

oddsData=Map[oddsToNumbers,oddsData];
oddsData=Dataset[KeyMap[StringJoin[Capitalize[StringSplit[#,"_"]]]&,#]& /@ Normal[oddsData]];


With a simple function we extract the individual goals scored from both teams.

extractGoals[a_] :=
Append[a,
<|
"Team1Goals" -> FromDigits[StringSplit[a["Score"], ":"][[1]]],
"Team2Goals" -> FromDigits[StringSplit[a["Score"], ":"][[2]]]
|>
]

oddsDataNew = Map[extractGoals[#] &, oddsData];


Now data is ready. The code below adds a new column named GoalsTotal to our dataset to provide the total goals for every match. Afterwards with the function addRating we add the ratings for each club in the league.

dataNew = oddsDataNew[All,

Append[#,
"GoalsTotal" -> Slot["Team1Goals"] + Slot["Team2Goals"]] &];

ratingsRawForOddsData = SemanticImport[
"/home/mher/Desktop/BPS_Project/Data/oddsportal_selenium/ratings_for_odds_data_names.csv",
<|1 -> "String", 2 -> Automatic|>];

ratings = AssociationThread @@ Transpose[Values /@ Normal[ratingsRawForOddsData]];

Join[d,
Association[
"Team1Rating" -> (ratings[d["Team1"]]),
"Team2Rating" -> (ratings[d["Team2"]])
]
];



Now that we have added the ratings and total goals we can start to compute some individual features for every team and add them to our dataset. To accomplish that we first group all the matches by home and away status for every team. Afterwards we add to our dataset a very important data column - the individual total goals average per match. This feature is essential for predicting the total goals of a certain match (the prediction of total goals is not included in this post, it will be published later).

homeAndAway =
dataNew[
Association[
"Home" -> GroupBy[#Team1 &],
"Away" -> GroupBy[#Team2 &]
]
];

Join[d,
Association[
"Team1TotalAvrg" ->
(N[Mean[Join[homeAndAway["Home", d["Team1"], All, "Team1Goals"],
homeAndAway["Away", d["Team1"], All, "Team2Goals"] ]], 3]),
"Team2TotalAvrg" ->
(N[Mean[Join[homeAndAway["Away", d["Team2"], All, "Team2Goals"],
homeAndAway["Home", d["Team2"], All, "Team1Goals"]]], 3])
]
];



Another powerful parameter for predicting the total goals is the total number of goals that were scored during head to head clashes of two specific teams.

 addHeadToHeadTotalGoalsAvrg[a_] :=
Append[a,
"H2HTotalAvrg" ->
Or[And[#Team1 == a["Team1"] , #Team2 == a["Team2"]] ,
And[#Team1 == a["Team2"], #Team2 == a["Team1"]]] &][All, "GoalsTotal"]], 3])
]



One of the most influential factors that affect the outcome of a match is the home and away statuses of the teams. The reason behind is that relatively weak teams tend to play better at home stadium.

addHomeTeamStatus[a_] :=
Append[a,
"HomeTeam" -> "Team1"
]



Continuing the data transformation we added the outcomes of the matches. We give them numerical values: 1 for the win of the home team, -1 for the win of the away team and 0 for draw. This choice makes no difference in case of the classification done by the function Classify. However, it is crucial for predicting the outcome with the function Predict. Obviously, when needed, we can always remap the numerical values to the three outcome categories. Another important advantage of using numeric outcomes is that it allows to add Head to Head statistics feature which numerically shows which team has advantage over its opponent in previous matches. Eventually we are going to compare the accuracies when using categorical and numerical outcomes.

addOutcome[a_Association] :=
If[
a["Team1Goals"] == a["Team2Goals"],
Append[a, Association["Outcome" -> 0]],
If[a["Team1Goals"] > a["Team2Goals"],
Append[a, Association["Outcome" -> 1]],
Append[a, Association["Outcome" -> -1]]
]
]



### Data Filtering

In this phase of the data processing we add one more valuable feature which is Head to Head statistics. It basically illustrates the advantage of Team1 over Team2 during their previous matches. Then we filter the data based on the minimum number of total matches played between two specific teams.

possibleMatches = DeleteDuplicates[Values@Normal@outcomeAdded[All, {"Team1", "Team2"}]];

Mean[
Join[
Normal@
outcomeAdded[Select[#Team1 == t1 && #Team2 == t2 &],"Outcome"],
- Normal[outcomeAdded[Select[#Team2 == t1 && #Team1 == t2 &], "Outcome"]]
]
]

filterLowStatistics[t_?NumberQ] := filterLowStatistics[#, t] &
filterLowStatistics[list_, t_] :=
Module[
{matches, counts, toKeep},
matches = Sort /@ Lookup[list, {"Team1", "Team2"}];
counts = Replace[matches, Counts[matches], {1}];
Pick[list, toKeep]
]



Here are the first ten rows and all the added columns of our dataset.

## Building the Classifier of Outcome

Now we can initialize and train our classifier. There are 3 categories for the outcome of the match, namely 1 for Team1 victory, -1 for Team2 victory and 0 for the Draw. Here the dataset is separated into the training set and the test set. The columns which represent features that are relevant in predicting the outcome are HomeStatus, Team1Rating, Team2Rating and head to head statistics, labeled as H2H.

n = Round[0.9Length[filteredData]];
{trainingSet,testSet}=TakeDrop[filteredData[All,{"HomeTeam","Team1Rating","Team2Rating","H2H","Outcome"}],n];

c = Classify[trainingSet->"Outcome",Method->"LogisticRegression"]


Here we use the function ClassifierMeasurements to check the accuracy of our classifier.

cm = ClassifierMeasurements[c, testSet -> "Outcome"]


cm["Accuracy"]
0.657895


The classification accuracy is almost 66%. This number shows how powerful is the head to head statistics feature. Without this feature the accuracy would only be 58%. Now let's look at the confusion matrix plot which is very useful feature of ClassifierMeasurements function and allows to analyze the wrong classifications of the created model.

cm["ConfusionMatrixPlot"];


From the plot we can see that our classifier does not tend to predict Draw as an outcome.

Let's analyze the results a bit. By subtracting the outcomes of actual and predicted numerical classes we get a list named difference which contains integers from -2 to 2. If the subtraction of actual and predicted values is 0 then the classification was valid. Numbers -2, -1, 1 and 2 are the matches which were classified incorrectly.

actaulValue=testSet[All,"Outcome"] //Normal;
predictedValue=c[testSet];
difference=actaulValue - predictedValue;

N @ Normalize[Counts[Abs[difference]],Total]
<|0 -> 0.657895, 1 -> 0.192982, 2 -> 0.149123|>


Continuing with the analysis we see that more than 65% of the matches in the testSet were classified correctly. The rest of the matches were classified incorrectly but with a little difference. The matches that have value 1 in the difference list were classified wrongly with just 1 step of miss. Let us bring an example to make this a bit clearer. Let's take a match to which corresponds a value of 1 in the difference list. If our classifier predicted victory of Team1 and it missed (obviously) that means the match was drawn, because the difference of actual and predicted values is 1. Similarly, we can state that nearly 15% of the matches were totally misclassified, because the absolute value of the difference is 2.

## Building the Predictor of Outcome

At this part of the project we use a different approach to find out the outcome of the match. Now we don’t try to predict the category of outcome as in case of the function Classify, but rather a numerical outcome on an axis that goes from -1 (win), to 1 (loss), where values close to 0 are interpreted as draws. Similar to the method used in previous section we prepare the training and test sets.

n = Round[0.8 Length[filteredData]]
1822

{trainingSetFull, testSetFull} = TakeDrop[filteredData, n];
{trainingSet, testSet} = {trainingSetFull[All, {"Team1Rating", "Team2Rating", "H2H", "Outcome"}], testSetFull[All, {"Team1Rating", "Team2Rating", "H2H", "Outcome"}]};


We train the function and then we use the function PredictorMeasurements to test our prediction on some test data.

p = Predict[trainingSet -> "Outcome"]


pm = PredictorMeasurements[p, testSet -> "Outcome"]


Now let' s run the predictor on our test to see what it does predict.

Max[p[testSet]]
Min[p[testSet]]

1.26189
-0.872063


So we got numerical values for the outcomes of the matches. It gave values ranging from roughly - 0.8 to 1.2. Later on we will write a function to convert the numerical values to categories but for now let us analyze the ResidualPlot of our predictor.

pm["ResidualPlot"]


The plot above shows the difference between actual and predicted values for the outcome of the matches of the test set. Every point represents a single game. We can state that points that lie on y=0 which represent difference equal to 0 are the perfect guesses. Moreover points that lie near the 0 axis are almost perfect guesses. All other points have a certain amount of distance from 0 which basically shows how far the predictions were from the right guess. Continuing with the outcome predictor, we must convert the numerical values of the predictor into 3 categories based on the results of the residual plot. For now we only will create a very rough approximation to see how much accuracy we can get.

actual = testSet[All, "Outcome"] // Normal;
outcomeBackToClasses = Function[UnitStep[# - .2] - UnitStep[-# - .2]];


Here we classified the predicted numbers in range (-0.2 ; 0.2) as 0-s. Moreover numbers greater than 0.2 are now classified as 1s and those less than -0.2 to -1-s. Similar to the method we used in case of the Classifier, we try to find out the accuracy of prediction as well as how bad our predictor did in the cases of wrong predictions.

predictedOutcome = outcomeBackToClasses[p[testSet]] ;

dif = actual - predictedOutcome;

N @ Normalize[Counts[Abs[dif]], Total]

<|0 -> 0.631579, 2 -> 0.0964912, 1 -> 0.27193|>


Although the accuracy of our classifier has decreased slightly we see a fundamental improvement in totally missed guesses which are represented with number 2. The percentage of total misses decreased from 15% to 0.09%. Another advantage that the predictor has over the classifier is that now it also guesses reasonable amount of draws. Those advantages are very effective in betting, because draws are considered difficult events to predict and usually odds for draws are much higher than those for wins.

## Betting Simulation and Strategies

Here we start probably the most interesting part of the project—betting simulations. As it was mentioned above, our data has all the odds for the outcomes of all the matches of the league. There are 4 columns for the odds— win of Team1, win of Team2, Draw and the wining coefficient - CoefWin. Here is a sample of the odds data.

Now let's simulate simple bets. To do that we randomly choose 10 matches and give them to our predictor to guess the outcome. As a result we get numerical values which we then convert into three classes with the help of the piece-wise function. After that we get a list containing 1s, -1s and 0s. Those are the outcomes that the predictor guessed for randomly selected 10 matches.

randomMatches = RandomSample[testSetFull, 10];

p[randomMatches[All, {"Team1Rating", "Team2Rating", "H2H", "Outcome"}]]
{0.459584, -0.0321182, 0.967641, 0.459716, 1.04778, 0.201723, 0.399631, -0.0359151, 0.831428, -0.085207}

randomMatchesPredicted = outcomeBackToClasses[p[randomMatches[All, {"Team1Rating", "Team2Rating", "H2H", "Outcome"}]]]
{1, 0, 1, 1, 1, 1, 1, 0, 1, 0}


Than we write a rule and convert the guesses to the corresponding coefficients.

randomMatchesPredicted /. {0 -> "CoefX", 1 -> "Coef1", -1 -> "Coef2"}
{"Coef1", "CoefX", "Coef1", "Coef1", "Coef1", "Coef1", "Coef1", "CoefX", "Coef1", "CoefX"}


Let's remember how we determined the correct guesses. We simply checked if the difference between predicted and actual class is 0 or not. Therefore we construct a function that takes the difference between actual and predicted classes as an argument and returns 1 if and only if the difference is 0. Than we multiply the result of that function to the corresponding winning odds. This way we simulate a bet of 1 dollar for every match. That function is

1 - Abs[Sign[dif]]


Here is the result of the multiplication

(1 - Abs[Sign[Normal@randomMatches[All, "Outcome"] - randomMatchesPredicted]]) Normal @ randomMatches[All, "CoefWin"]
{1.93, 3.26, 1.15, 2.38, 1.11, 1.37, 1.36, 3.12, 1.88, 0.}


So the question is weather we actually win? To figure that out we do Total of the won money and then subtract from it the 10 dollars we have bet in total (1 dollar for a match).

Total[%] - 10
7.56


Booom. We won 7.56 dollars. However, do not rush to gamble. This process is purely probabilistic. That is, we could also lose money firstly because we have 66% of accuracy. And secondly, even if we had 100% accuracy based of the data of previous matches, we could not always win on future matches. To gain a better understanding of how profitable this kind of betting is, one needs to perform a sufficient number of betting simulations. Then only after analyzing the results and after calculating mean profit it would be possible to speak about long term wins. However, the significance in the numeric approach to the prediction of the outcome is that we can decide on what event to bet based on numeric predictions. Let me bring an example. Below are the predicted numeric values for randomly selected 10 matches.

{0.459584, -0.0321182, 0.967641, 0.459716, 1.04778, 0.201723, 0.399631, -0.0359151, 0.831428, -0.085207}


As we can see there are numbers which are near-perfect meaning those are very close to either 1, -1 or 0. For example take the 2nd and 8th numbers of the numeric predictions. Both of them are equal to -0.03, which means those are near perfect Draw guesses. So we can use this fact to make an xpressbet, which is a particular type of bet in which the player predicts multiple events and wins only if all the predicted events happen. In that case we get the odds multiplied to each other. In this particular case we try to predict 2 events. The odds for draws were 3.26 for the 2nd match and 3.12 for the 8th match so the total odd would be 10.17. And if we bet on this odd 1 dollar we earn 9 dollars, just by betting on 2 matches with an xpressbet. This strategy minimizes the risk of losing money at the same time maximizes the profit. However the probability to win falls because we have done an xpressbet. That is, we need both of the predictions to be correct. And this is where the predictor helps us to decide what are the near-perfect guesses to link and bet on. We see that the flexibility of our model allowed us to come up with some betting strategies based on the result of our numerical predictor. Definitely will be tested this weekend.

## Conclusion and Future Work

• 66% accuracy for classifier
• 63% accuracy for predictor
• optimal risk-profit betting strategy
• development of total goals predictor

## References

1. “Scrapy | A Fast and Powerful Scraping and Web Crawling Framework.” Accessed August 16, 2016. http://scrapy.org/.
2. Bernard, Etienne. “Posts by Etienne Bernard—Wolfram Blog.” Accessed August 18, 2016. http://blog.wolfram.com/author/etienne-bernard/.
Attachments:
2 years ago
6 Replies
 - you earned "Featured Contributor" badge, congratulations !This is a great post and it has been selected for the curated Staff Picks group. Your profile is now distinguished by a "Featured Contributor" badge and displayed on the "Featured Contributor" board.
2 years ago
 Nice post! I've read here that you used Scrapy to get information. Have you tried anyother scraping tools? I've only tried price scraping so far on http://www.tellprices.com/ and I was impressed.. Thiught maybe you could give me some advice on what other scraping tools you find effective. Thanks!
1 year ago
 Thank you very much! Yes, I've used Python framework - Scrapy for obtaining the data for this project. It can be used for both small projects and also for building complex spiders. There are a lot of tools for web-crawling, but the ones that I find effective besides Scrapy, are Beautiful Soup, which is easy python library for pulling data from HTML/XML pages, and Selenium WebDriver, which provides Python API to use its powerful functionality. It was originally written for testing web pages, but is widely used for data crawling, especially in those cases where there are complex captchas and other barriers of accessing data.
1 year ago
 Please see a followup discussion here: http://community.wolfram.com/groups/-/m/t/1218215 Predicting Winning Odds of Italian Serie A Soccer 1934-2017
9 months ago
 Nice post indeed, Mher!I have studied your code a while and I have some comments.Sorry if I misunderstood your code but here it goes... As I ran your code I suspected the H2H calculation was not according the description you've provided.H2H calculations should only take into account the previous games as one should expected and you said as well.Your code gives us an single H2H value for every two teams confrontation. That means H2H would be a wonderful predictor indeed since it contains information about all future confrontations of any two teams!For a single game H2H calculation didn't take the date as a reference.So the calculation takes all team1 against team2 and team2 against team1 games...66% accuracy for classifier drops to something around 57%, as I have made the following changes: headToHeadTotal[t1_, t2_, dt_] := N[Mean[ Join[ Normal@ outcomeAdded[ Select[#Team1 == t1 && #Team2 == t2 && #MatchDate < dt &], "Outcome"], -Normal@ outcomeAdded[ Select[#Team2 == t1 && #Team1 == t2 && #MatchDate < dt &], "Outcome"] ] ]] And: h2hAdded = outcomeAdded[All, Append[#, "H2H" -> (headToHeadTotal[#Team1, #Team2, #MatchDate])] &]; So the result is:Would you please confirm my interpretation?I have not built the predict of outcome yet so I do not have final numbers to show...
8 months ago
 Thank you for your comments Edson. You are absolutely right in what you say. However, the absence of date as reference in calculating head to head statistics has several reasons. Firstly, as you can see in the description of 'H2H' statistics, the low statistics are filtered. This shows that the variable 'H2H' variable initially was not intended to change over time, rather I wanted to have a general indicator of head to head clashes over some relative time period. Secondly, adding date as a reference has a big drawback. Actually it does not do what it is supposed to do. It does not show the 'past' performance in absolute manner (as our data does not contain all the historical data), but shows the 'past' performance of two particular teams only relative to the time period in the dataset that we give to the classifier. And I think the latter is the reason why adding the date as reference decreases the accuracy of the model. So, the following sentence in the text:''It [H2H] basically illustrates the advantage of Team1 over Team2 during their previous matches.'' means the previous matches before the match that is being predicted. (I agree it is not well formulated).