Alex Trommer

Data Science · University of Michigan · [email protected] · GitHub

Projects

League Winner Predictor

Each season, only one team wins each league — and their forwards tend to have significantly higher goal and shot counts than everyone else. This project trains a K-Nearest Neighbors (KNN) classifier to predict whether a given forward's team won the league that season, using only two features: goals scored and shots on target. Trained on 88,310 player-season records across seven European leagues and five seasons, the model achieves 95.45% accuracy.

95.45% Final accuracy
94.35% Baseline accuracy
88,310 Player-season records
7 Leagues · 5 seasons

Dataset

Player-season records from the Premier League, La Liga, Serie A, Bundesliga, Ligue 1, Eredivisie, and Primeira Liga. The model is restricted to forwards with at least five 90-minute appearances, where the goal-scoring signal is strongest. The 2019–20 season is excluded due to the Eredivisie being suspended mid-season.

Why this works

League-winning teams dominate possession and create more chances — so their forwards rack up noticeably more goals and shots than forwards on lower-finishing sides. This gap is large enough that a simple classifier can pick up on it reliably.

GroupAvg GoalsAvg Shots on Target
Non-winners6.6016.96
Winners13.5929.37

Data cleaning

Model

KNN classifies each player by looking at the 11 most similar player-seasons in the training data (by goals and shots) and taking a majority vote on whether those neighbors' teams won the league. Hyperparameters were tuned via GridSearchCV across 150 combinations and 5-fold cross-validation.

Features: Shots On Target, Goals, Season (one-hot encoded) Scaler: StandardScaler on numerical features Search: GridSearchCV — 5-fold CV, accuracy scoring Best params: n_neighbors 11 metric euclidean weights uniform CV score 0.9545

Key findings

Ideal Corner Kick Delivery Zones

Where should you aim a corner kick? Using StatsBomb open event data, every delivery is mapped onto a 120×80 pitch and statistical testing identifies which zones produce a meaningful improvement over the 2.9% baseline success rate.

GA3 Only significant zone
6.2% GA3 success rate
2.9% Baseline rate
p < 0.0001 GA3 z-test

Data & setup

Corner kick events from StatsBomb open data. A success is defined as a goal within 10 seconds of the corner. All corners taken from the bottom side of the pitch have their Y coordinates mirrored so every delivery is analysed relative to the same goal. Short corners — where the ball is played to a nearby teammate rather than delivered into the box — are isolated as a separate category.

The boxes

In soccer, the 18-yard box (penalty area) is the large rectangle in front of goal — roughly the area where most corners are aimed. The 6-yard box (goal area) is the smaller rectangle directly in front of the goal mouth, spanning only the width of the goal plus a few yards on each side. Deliveries into the 6-yard box are riskier to defend because they arrive very close to goal, but harder to aim accurately. The near post is the goalpost closer to where the corner is being taken; the back post (far post) is the one farther away. Back-post deliveries are harder for goalkeepers to reach and give attackers a run-up toward goal.

Zone classification

Zones follow the framework from Casal et al. (2019), dividing the attacking end of the pitch into nine named regions based on X (depth) and Y (width) coordinates on a 120×80 pitch. Zone names use a letter prefix for area (G = goal area / 6-yard box, C = central 18-yard box, F = front, B = back) and a number for position across the width (1 = near post, 2 = centre, 3 = back post).

ZoneDescriptionXY
GA36-yard box, back/far post114–12030–36
GA26-yard box, centre114–12036–44
GA16-yard box, near post114–12044–50
CA318-yard box, back/far post108–11430–36
CA218-yard box, centre108–11436–44
CA118-yard box, near post108–11444–50
FZFront of 18-yard box, near-post side102–12050–62
BZFront of 18-yard box, back-post side102–12018–30
EdgeEdge of 18-yard box, central100–10830–50

Delivery locations

Corner delivery locations All delivery endpoints. Green = goal within 10 s, red = no goal. Successful deliveries concentrate near the back post of the six-yard box.

Swing type

In-box vs out-of-box

Delivery typeCountSuccess rate
In box27,2153.19%
Out of box6,8311.73%

Chi-squared p = 1.53×10⁻¹⁰ — in-box deliveries significantly outperform out-of-box deliveries.

Success rate by zone

Corner kick success rate by zone GA3 (back post, six-yard box) leads at 6.2% — more than double any other zone and the only zone to test as statistically significant.
Short corners vs delivery zones Short corner deliveries (1.87%) underperform every named in-box zone and fall below the overall baseline — the worst strategy across all scenarios.

Zone by swing type

ZoneSwingCountSuccess rate
GA3inswinging3296.99%
GA3outswinging5425.72%
CA3inswinging7174.32%
CA2outswinging4,0414.31%
Edgeinswinging4213.56%
GA2inswinging1,7743.27%

GA3 inswinging vs all others: z = 4.451, p < 0.0001. The zone × swing type interaction model shows no significant interaction effects — GA3 leads under both swing types.

Success rate by zone and swing type GA3 tops both halves — 7.0% inswinging, 5.7% outswinging. No other zone is consistently strong across swing types.

Logistic regression

Two logistic models were fitted — main effects (zone + swing type) and interactions (zone × swing type). In both, only the GA3 coefficient is statistically significant. All other zones and swing type show no significant independent effect on success probability.

TermCoefp-value
Intercept (BZ baseline)−3.557< 0.001
GA3+0.864< 0.001
CA2+0.3730.075
CA3+0.2660.252
Outswinging−0.0380.579

Optimal delivery coordinate

Degree-9 polynomial fits on binned success rates across end_x and end_y coordinates narrow the optimal delivery point to approximately x ≈ 113, y ≈ 33 — deep into the six-yard box, back-post side.

Success rate vs end_x Success rate vs end X. Polynomial fit peaks near x ≈ 113.
Success rate vs end_y Success rate vs end Y. Polynomial fit peaks near y ≈ 33 — back-post side.

Recipient body part

Headers at the far post dominate successful outcomes. Non-header contacts (chest, foot) consistently underperform regardless of zone or swing type.

PostSwingBody partCountRate
Far postoutswingingHead2,4858.61%
Far postinswingingHead1,3788.35%
Near postinswingingHead1,0428.06%
Near postoutswingingHead1,8977.33%
Near postoutswingingOther4,7931.79%
Far postoutswingingOther12,1461.48%

xG Dashboard

An XGBoost expected goals model and interactive dashboard covering the top 5 European leagues, built on ~257,000 shots from Understat (2020–2025). Three situation-specific models handle open play, corners, and set pieces separately, with isotonic calibration. Data refreshes daily via GitHub Actions.

257kshots
0.792ROC-AUC
0.074Brier score
5leagues
3specialist models

Models

Each shot is routed to a situation-specific XGBoost classifier based on how it was created. All three are wrapped in CalibratedClassifierCV(method="isotonic") and tuned independently via GridSearchCV (3-fold CV, Brier score). Penalties are fixed at 0.76 xG.

ModelSituationsKey features
OpenPlayOpen play, counter-attacksDistance, angle, counter-attack proxy, throughball, rebound
FromCornerCorner kicksHeader interactions, centrality, weak-angle header
SetPieceDirect & indirect free kicksDistance, angle, shot type

Features

24 features across geometry (distance, angle, coordinates), shot type flags (header, foot, penalty), interaction terms, zone context, and proxy variables. Because Understat labels counter-attacks as open play, a fast_break proxy is engineered from the preceding action type to capture counter-attack context without direct tagging.

Performance

MetricThis modelUnderstat
ROC-AUC0.7920.805
Brier score0.0740.072

The small gap vs Understat is expected — commercial models incorporate freeze-frame data (exact defender positions at the moment of the shot) which Understat does not expose via their public API.