Monday 11 June 2018

A World Cup prediction model

In World Cup years, prediction models abound. Described here is a simple logistic model and its results.

The model

Independent variables

  • Difference in ELO scores (taken from eloratings.net/)
  • Dummy variable to identify hosts
  • Dummy variable to identify teams from the same confederation as the hosts (usually nations from the same continent) 

Model generation

Two logistic regressions were run, using the same variables but different training sets. One used all group games from the last 3 World Cup finals (referred to subsequently as the Group Model) and the other used all knockout games (Knockout model). For knockout games where extra time was played, the result after extra time was taken. Otherwise, the result after 90 minutes was used.

Comparison of models

It is noted that there are differences between the variable weights between the Group Model and the Knockout model.

Group Model

Knockout Model

Home nation advantage is strong in the group stage. In the knockout stage, a home nation still has an advantage, but there is a larger advantage for a team in the same confederation without the host status. 

Simulations

A Monte Carlo simulation is run with 2,500 trials as described below.

Group stages

The Group Model is used to calculate probabilities for each outcome (either team winning or a draw) for each group game in the 2018 World Cup and simulate a result. This is done chronologically. After each game is simulated, the ELO score for each team is updated. Once all group games have been simulated, group tables are calculated. Since the model only generates a result, not goals scored or winning margin, ties are broken randomly.

Knockout stages

These tables allow the knockout fixtures to be generated. Results for these are generated according to probabilities derived using the Knockout model. Any draws are settled randomly (note that a draw here refers to a game level at the end of extra time ie. a game decided by penalties). Again, ELO scores for each team are updated following each game.

Results

Following 2,500 simulations, probabilities can be generated for each stage of progress for each team. These are shown below, with the table sorted by winning probability, then probability of reaching the final, etc. Blanks indicate an outcome that did not occur in any of the simulations.
No comment is offered on these results.

Known shortcomings of the method

No single number, such as ELO rating, can ever truly capture the nature of a team. It is blind to player form or fitness, tactics and any particularly favourable or unfavourable positional matchups between two teams.

Breaking ties randomly, particularly in the group stage is probably not optimal. It would be reasonable to expect that in a tie, the better  team would hold the superior goal difference even though the points tallies are the same. For knockout matches this is less of a factor as it is only penalty shootouts which are determined randomly. Although there is certainly an element of skill in a penalty shootout, it is definitely very subject to randomness.

Footnote:

I aim to update these probabilities throughout the tournament as time allows. I also aim to review the predictions after the tournament, particularly in comparison to those given by Opta, reported here and those from Zeileis et al. at the Universitat Innsbruck as these give a similar level of detail (each team, each stage of progress, probabilistic, generated around the same time). Any other models I find that meet these criteria will also be included. This is more from an interest perspective than because I expect the simple model presented here to be superior.

No comments:

Post a Comment