highperformancestats: Searching for new positional metrics (part 1: method and centre midfielders)

Introduction

In my first ever post, I showed the value of a new metric, chances created per incomplete pass. This metric was developed from an intuitive concept of how chance-creating passes are often the most difficult and was designed to level the playing field between defensive players with (in general) a high pass completion and a low chance creation and attacking players, for whom the opposite is true.

While the value of such a metric was shown, perhaps a preferable solution would be the development of specific metrics for each positions. This development is attempted here.

Described below is a possible method for finding such methods. There are a number of shortcomings in the execution of the method, due both to processing capabilities and the nature/volume of data available. These are acknowledged and explained at the end.

Method

Player data from the MCFC Analytics dataset is split into the following position groups (these are somewhat subjective, as with anything relating to formation. A table showing the classification used by formation code and position code is shown below):

Goalkeepers
Full backs
Centre backs
Wing backs
Defensive midfield
Central midfield
Wide midfield
Attacking midfield
Winger
Centre forward

The data is grouped by team and match so, for example, raw data for all of Liverpool's full backs for the game against Swansea is added together.

Following this, a number of data fields are chosen and simple metrics sought of the form:

Metric = D1/D2

This form is used to allow rates to be found, if they are selected as appropriate metrics.

Each possible metric is generated in turn from the data fields selected and tested. The testing process involves finding correlation coefficient between the metric and three possible dependent variables:

Goals scored by the team
Goals conceded by the team
Game margin (Goals scored - Goals conceded)

This process will seek out linear relationships between the metrics and these key outcomes. To look for other relationships, other correlation coefficients are found between the metric and the exponential of the dependent variables listed above and between the exponential of the metric and the dependent variables. So for each possible metric, 9 relationships are observed.

In some cases, the metric will be undefined, if the denominator is zero. To counter this, metrics are only included in the study if they are defined for at least 50% of the observations.

Independent variables

The independent variables chosen are:

Shots on target
Shots off target
Complete passes
Incomplete passes
Successful dribbles
Unsuccessful dribbles
Duels won (aerial + ground)
Duels lost (aerial + ground)
Tackles won
Tackles lost
Blocks
Interceptions

Results: Centre midfielders

Metric chosen

For centre midfielders, the strongest correlation between any metric and any dependent variable was the linear relationship between margin of victory and the metric:

Successful passes/Unsuccessful dribbles

The correlation coefficient between the two is approximately 0.33. This does not indicate an exceptionally close relationship but the chart below shows that there does appear to be a (weak) relationship present:

The linear relationship between this metric and the margin of victory is described by the equation below:

Margin = 0.017*Successful passes/Unsuccessful dribbles - 1.11

Performances

For this metric, the best performing central midfield unit in 2011/12 was Manchester United v Stoke City on 31/1/12, Michael Carrick and Paul Scholes. This unit completed 242 passes and only had 1 unsuccessful dribble (note that, to be included, the unit must have at least one unsuccessful dribble). According to the formula above, Manchester United would expect to win this game by 2.94 goals - they in fact won 2-0.

The worst performing unit was Aston Villa v Bolton Wanderers on 24/4/12, Chris Herd, Marc Albrighton and Stephen Warnock. This unit completed 53 passes and had 7 unsuccessful dribbles. This would be expected to lead to a 0.99 goal defeat - the final score was 2-1 to Bolton Wanderers.

The biggest overachievement was Manchester United v Bolton Wanderers on 10/9/11. Anderson and Tom Cleverley completed only 67 passes and had 1 unsuccessful dribble, which leads to an expected margin of 0.009 goals, yet Manchester United won 5-0. This disparity is as a result of a shortcoming of this method, which is discussed below.

The biggest underachievement was Blackburn Rovers v Arsenal on 4/2/12. Radosav Petrovic and Steven N'Zonzi completed 69 passes and had 1 unsuccessful dribble, which gives an expected margin of 0.04 goals, yet Arsenal won, 7-1. These two midfielders actually outperformed the Manchester United midfield v Bolton Wanderers, mentioned in the previous paragraph.

Conclusions

This method of finding metrics certainly has potential to develop more useful measures than those which have become standard - pass completion and possession rate, for example. The metric shown above provides a useful look at what can be expected of a centre midfielder: complete passes and avoid being caught in possession. While the job is of course more complicated than that, these factors can be included among the "basics" of sound central midfield play.

Shortcomings

As mentioned above, the execution of this method has some shortcomings, though I believe that the principal behind it is sound. These shortcoming are attributable to two main factors:

Processing capability:

The number of data fields considered was limited. For a metric including two data fields, the total number of metrics to be checked has a quadratic relationship with the number of data fields. 12 data fields were included here: even with an automatic Excel process for cycling through the combinations, processing took several hours. Ideally, many more data fields would be included, such as those related to position on the pitch (passes in each third, shots inside/outside the box).
Originally it was intended to seek metric of more complex forms than the simple fraction given above. It was hoped that more than two data fields could be included, for example in the form (D1*D2)/(D3*D4). However, this modifies the quadratic relationship given above to a quartic relationship between number of data fields and number of metrics.

Data available:

A key issue here is the treatment of substitutes. Within the MCFC Analytics dataset, substitutes are not assigned a position, simply being described as a substitute. This is notable in the Manchester United v Bolton Wanderers match mentioned above. Tom Cleverley was substituted off after 8 minutes, to be replaced by Michael Carrick. However according to the dataset, the only central midfielders were Cleverley and Anderson - and hence the total number of passes completed by the central midfield was much lower than it otherwise would have been.
It has long been known in analysis of American Football that "garbage time" must be taken into account. Teams that are leading tend to rush more while those that are trailing pass more, leading to statistics being skewed in the late stages of games. Likewise in football, teams that are winning easily in the closing stages can often keep possession with limited pressure from the opposition, which increases passing statistics. This team, however, has not won because of these passes - they have passed because they have won. This garbage time can blur the line between cause and effect. Knowing the time and match score for each pass would vastly increase the complexity of the problem approached here but would make the project much more complete.
The final shortcoming is in the global nature of these metrics. There are 20 teams in the Premier League who each have contrasting styles of play. As well as that, the same team can employ different tactics for different opponents. A team that is built around long passes to a target man will naturally have fewer complete passes than a shorter passing team, but this is by design. The metric described here would penalise such teams, though if these tactics suit the personnel in the team, there is no reason why this team can not have the same success as any other. This metric does not distinguish between different teams and gameplans and does not show how well a gameplan is executed - it merely considers a league average. Ideally, each team would have their own metric or at least, their own linear relationship. However this would give each team only 38 data points, which is too few from which to draw reliable conclusions.

Future

Metrics have been created for each position except goalkeeper. These will be discussed in future posts. It should also be possible to combine all the position metrics to find the best and worst overall team performances and also to see which positions have the most impact upon match results.

highperformancestats

Saturday, 9 February 2013

Searching for new positional metrics (part 1: method and centre midfielders)