Football Stats

Built and maintained by Alex Trommer – atrommer@umich.edu

Introduction

As a huge fan of football (soccer), I chose to investigate upon a data set of football players and their various statistics, such as goals, age, 90s played (90 minutes played), shots on target, nation, position, as well as their teams, the league they play in, and for what season this is representing. Below is a sample of the dataset. This dataset has 88,310 rows and 10 columns, however only a fraction of these rows will actually be used.

Season	League	Team	Player	Nation	Position	Age	90s	Goals	Shots On Target
2000-2001	EPL	Manchester Utd	Gary Neville	eng ENG	DF	25	31.7	1	5
2000-2001	EPL	Manchester Utd	Fabien Barthez	fr FRA	GK	29	29.7	0	0
2000-2001	EPL	Manchester Utd	David Beckham	eng ENG	MF	25	29.4	9	34
2000-2001	EPL	Manchester Utd	Paul Scholes	eng ENG	MF	25	27.2	6	18
2000-2001	EPL	Manchester Utd	Roy Keane	ie IRL	MF	28	26.4	2	15

The question I have that I will be looking to answer is “Can I build a model to accurately predict whether or not a player won the league in a given year?”

Data Cleaning and Exploratory Data Analysis

To kick things off (haha), we have to clean the data: 1: I filled NA goals with 0 2: I renamed some columns that were mispelled 3: I dropped a couple of rows that had data we will not be using (such as squad total)

Out of curiosity, let’s see who has scored the most amount of goals in a single season! I will find the highest goal tally and get the index!

Player	Team	Season	Position	Age	Goals	Shots On Target
Lionel Messi	Barcelona	2011-2012	FW,MF	24	50	114

Unsurprising to me, given it is Messi, but interesting nonetheless.

What about finding the top 50 scorers per year? Well this is pretty simple to accomplish. By filtering the data and grouping the season, here are the top 50 scorers per year in a scatterplot!

However, as you can see along the y-axis, there is an issue with some of the data. Let’s take a step back. In soccer, a player has to get a shot on target in order to score. Realistically, it makes no sense. This must have gotten replaced during my data cleaning – likely an NA replaced as a 0.

League	0
Bundesliga	18
EPL	26
Eredivisie	32
LaLiga	11
Ligue1	61
PrimeiraLiga	452
SerieA	36

That is the distribution of cases in which the goals > 1 but shots on target are 0. It is a widespread issue!

We do imputation based off of the league – a linear regression model!

After imputation, we filter the data as such: 1: Players who contain “FW” in their position 2: Played at least 5 90s (Ensures substitute players don’t influence the data too much with low goals shots on target if they win the league) 3: Data from only the last five years

Also, in the 2019-2020 season, the eredivisie was suspended because of the covid pandemic, so to be safe let’s drop the entire year.

Finally it’s time to add the new column, which will be used to train the data for a K-Nearest-Neighbor classifier prediction model. Let’s make a dictionary for the winners of the leagues for the last five years.

	2022-2023	2021-2022	2020-2021	2019-2020	2018-2019
EPL	Manchester City	Manchester City	Manchester City	Liverpool	Manchester City
LaLiga	Real Madrid	Real Madrid	Atletico Madrid	Real Madrid	Barcelona
SerieA	Napoli	AC Milan	Inter Milan	Juventus	Juventus
Bundesliga	Bayern Munich	Bayern Munich	Bayern Munich	Bayern Munich	Bayern Munich
Ligue1	Paris S-G	Paris S-G	Lille	Paris S-G	Paris S-G
Eredivisie	Ajax	Ajax	Ajax	Ajax	Ajax
PrimeiraLiga	Benfica	Porto	Sporting CP	Benfica	Benfica

This is the dictionary in dataframe form that I used to add in the winners!

Here is an updated scatterplot with everything so far.

Framing A Prediction

How do we know there is actually a difference between shots on target and goal metrics for players who won the league and those who did not? Well, here is a dataframe showing the mean for cases where players won their respective leagues or did not:

Winners	Goals	Shots On Target
False	6.60166	16.9581
True	13.5941	29.3663

There is a massive gap between the two! This means this is a logical thing to investigate and try to predict!

Baseline Model

To begin, I took the goals and shots on target columns as predictors, and set the winners column as the target. I trained a KNeighborsClassifier with the number of neighbors at 5. I trained the data and saved the instance for later. 25% of the data would then be testing data. This worked reasonably well, coming back with an accuracy of 0.9435.

Here is a plot showcasing the decision boundary of the model!

[TO BE UPDATED]

It looks alright, but we can do better. Also we must be wary of overfitting.

Final Model

Building off my baseline model, I decided to also One Hot Encode the Seasons column to try to improve the accuracy. Styles of play in football can change quickly and significantly, and some years may have star players that score crazy amounts of goals that are hard to replicate. This is a logical metric to OHE. Also implemented a scaler to see if that would improve the accuracy of the model.

The final model consisted of a KNN model pipeline with hyperparameter tuning. It includes feature preprocessing using ColumnTransformer to scale numerical features shots on target and goals and encode the categorical feature season. A Pipeline integrates preprocessing and KNN classification. Hyperparameters are tuned using GridSearchCV with 5-fold cross-validation. The best parameters ‘knn_metric’: ‘euclidean’, ‘knn_n_neighbors’: 11, ‘knn_weights’: ‘uniform’ and corresponding accuracy of 0.9545 were retrieved. I wanted to test this to see if there were better ways to weigh distance, calculate distance differently, or get a better number of neighbors. This told me the ideal number of neighbors is 11, and all other hyperparameters were fine as they were. This is slightly higher than the previous model, thus showing an improved accuracy with the same training dataset.