Statistical Football prediction is a method used in sports betting, to predict the outcome of football (soccer) matches by means of statistical tools. The goal of statistical match prediction is to outperform the predictions of bookmakers, who use them to set odds on the outcome of football matches.
Contents
- History
- Football Prediction Methods
- Time Independent Least Squares Rating
- Time Independent Poisson Regression
- Time Dependent Markov Chain Monte Carlo
- References
The most widely used statistical approach to prediction is ranking. Football ranking systems assign a rank to each team based on their past game results, so that the highest rank is assigned to the strongest team. The outcome of the match can be predicted by comparing the opponents’ ranks. Several different football ranking systems exist, for example some widely known are the FIFA World Rankings or the World Football Elo Ratings.
There are three main drawbacks to football match predictions that are based on ranking systems:
- Ranks assigned to the teams do not differentiate between their attacking and defensive strengths.
- Ranks are accumulated averages which do not account for skill changes in football teams.
- The main goal of a ranking system is not to predict the results of football games, but to sort the teams according to their average strength.
Another approach to football prediction is known as rating systems.While ranking refers only to team order, rating systems assign to each team a continuously scaled strength indicator. Moreover, rating can be assigned not only to a team but to its attacking and defensive strengths, home field advantage or even to the skills of each team player (according to Stern ). An example of a football rating system is the pi-rating system which provides relative measures of superiority between football teams (also applicable to other sports), and which is said to outperform considerably (in terms of profitability against the betting market) the widely accepted Elo rating system.
History
Publications about statistical models for football predictions started appearing from the 90s, but the first model was proposed much earlier by Moroney, who published his first statistical analysis of soccer match results in 1956. According to his analysis, both Poisson distribution and negative binomial distribution provided an adequate fit to results of football games. The series of ball passing between players during football matches was successfully analyzed using negative binomial distribution by Reep and Benjamin in 1968. They improved this method in 1971, and in 1974 Hill indicated that soccer game results are to some degree predictable and not simply a matter of chance.
The first model predicting outcomes of football matches between teams with different skills was proposed by Michael Maher in 1982. According to his model, the goals, which the opponents score during the game, are drawn from the Poisson distribution. The model parameters are defined by the difference between attacking and defensive skills, adjusted by the home field advantage factor. The methods for modeling the home field advantage factor were summarized in an article by Caurneya and Carron in 1992. Time-dependency of team strengths was analyzed by Knorr-Held in 1999. He used recursive Bayesian estimation to rate football teams: this method was more realistic in comparison to soccer prediction based on common average statistics.
Football Prediction Methods
All the prediction methods can be categorized according to tournament type, time-dependence and regression algorithm. Football prediction methods vary between Round-robin tournament and Knockout competition. The methods for Knockout competition are summarized in an article by Diego Kuonen.
The table below summarizes the methods related to Round-robin tournament.
Time Independent Least Squares Rating
This method intends to assign to each team in the tournament a continuously scaled rating value, so that the strongest team will have the highest rating. The method is based on the assumption that the rating assigned to the rival teams is proportional to the outcome of each match.
Assume that the teams A, B, C and D are playing in a tournament and the match outcomes are as follows:
Though the ratings
By introducing a selection matrix X, the equations above can be rewritten in a compact form:
Entries of the selection matrix can be either 1, 0 or -1, with 1 corresponding to home teams and -1 to away teams:
If the matrix
If not, one can use the Moore–Penrose pseudoinverse to get:
The final rating parameters are
Time-Independent Poisson Regression
According to this model (Maher ), if
while the generalized log-linear model for
Assuming that C signifies the number of teams participating in a season and N stands for the number of matches played until now, the team strengths can be estimated by minimizing the negative log-likelihood function with respect to
Given that
Improvements for this model were suggested by Mark Dixon (statistician) and Stuart Coles. They invented a correlation factor for low scores 0-0, 1-0, 0-1 and 1-1, where the independent Poisson model doesn't hold. Dimitris Karlis and Ioannis Ntzoufras built a Time-Independent Skellam distribution model. Unlike the Poisson model that fits the distribution of scores, the Skellam model fits the difference between home and away scores.
Time-Dependent Markov Chain Monte Carlo
On the one hand, statistical models require a large number of observations to make an accurate estimation of its parameters. And when there are not enough observations available during a season (as is usually the situation), working with average statistics makes sense. On the other hand, it is well known that team skills change during the season, making model parameters time-dependent. Mark Dixon (statistician) and Coles tried to solve this trade-off by assigning a larger weight to the latest match results. Rue and Salvesen introduced a novel time-dependent rating method using the Markov Chain model.
They suggested modifying the generalized linear model above for
given that
According to the model, the attacking strength
where
This model is based on the assumption that:
Assuming that three teams A, B and C are playing in the tournament and the matches are played in the following order:
Since analytical estimation of the parameters is difficult in this case, the Monte Carlo method is applied to estimate the parameters of the model.