require(rstanarm)
require(rstantools)
require(bayesplot)
Predicting Euro & Copa America - 2024
In this project, I will propose a model for predicting the Euro & Copa America - 2024 scores based on the model presented in (Florez, Guindani, and Vannucci 2024).
Packages
I will use a Bayesian model, so we need the following libraries
As well as
require(readr)
require(dplyr)
require(lubridate)
require(ggplot2)
require(ggrepel)
require(latex2exp)
Data
The data is contained in this repository https://github.com/martj42/international_results.
= "https://raw.githubusercontent.com/martj42/international_results/master/results.csv" urlfile
We load the data
<- read_csv(url(urlfile), show_col_types = FALSE) data
The proposed model is based on the teams’ strengths attacking and defending. To train the model, we used the scores of games played since season 2023-2024.
= c('Albania','Austria','Belgium','Croatia','Czech Republic','Denmark','England','France','Georgia','Germany','Hungary','Italy','Netherlands','Poland','Portugal','Romania','Scotland','Serbia','Slovakia','Slovenia','Spain','Switzerland','Turkey','Ukraine')
euro_2024
= sort(c('Argentina', 'Chile', 'Peru', 'Canada', 'Mexico', 'Ecuador', 'Venezuela', 'Jamaica', 'United States', 'Uruguay', 'Panama', 'Bolivia', 'Brazil', 'Colombia', 'Paraguay', 'Costa Rica'))
america_2024
<- data %>%
scores filter(home_team %in% euro_2024 & away_team %in% euro_2024) %>%
filter(date > "2023-06-01") %>%
mutate(id = row_number(), cup = "Euro") %>%
bind_rows(data %>%
filter(home_team %in% america_2024 & away_team %in% america_2024) %>%
filter(date > "2023-06-01") %>%
mutate(id = row_number(), cup = "America")
)
For example, we consider the following games played by Portugal.
%>%
scores filter(home_team == "Portugal" | away_team == "Portugal")
# A tibble: 11 × 11
date home_team away_team home_score away_score tournament city country
<date> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2023-09-08 Slovakia Portugal 0 1 UEFA Euro… Brat… Slovak…
2 2023-10-13 Portugal Slovakia 3 2 UEFA Euro… Porto Portug…
3 2024-03-26 Slovenia Portugal 2 0 Friendly Ljub… Sloven…
4 2024-06-08 Portugal Croatia 1 2 Friendly Oeir… Portug…
5 2024-06-18 Portugal Czech Re… 2 1 UEFA Euro Leip… Germany
6 2024-06-22 Turkey Portugal 0 3 UEFA Euro Dort… Germany
7 2024-06-26 Georgia Portugal 2 0 UEFA Euro Gels… Germany
8 2024-07-01 Portugal Slovenia 0 0 UEFA Euro Fran… Germany
9 2024-07-05 Portugal France 0 0 UEFA Euro Hamb… Germany
10 2024-09-05 Portugal Croatia 2 1 UEFA Nati… Lisb… Portug…
11 2024-09-08 Portugal Scotland 2 1 UEFA Nati… Lisb… Portug…
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>
and the following games played by Colombia
%>%
scores filter(home_team == "Colombia" | away_team == "Colombia")
# A tibble: 17 × 11
date home_team away_team home_score away_score tournament city country
<date> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2023-09-07 Colombia Venezuela 1 0 FIFA Worl… Barr… Colomb…
2 2023-09-12 Chile Colombia 0 0 FIFA Worl… Sant… Chile
3 2023-10-12 Colombia Uruguay 2 2 FIFA Worl… Barr… Colomb…
4 2023-10-17 Ecuador Colombia 0 0 FIFA Worl… Quito Ecuador
5 2023-11-16 Colombia Brazil 2 1 FIFA Worl… Barr… Colomb…
6 2023-11-21 Paraguay Colombia 0 1 FIFA Worl… Asun… Paragu…
7 2023-12-16 Mexico Colombia 2 3 Friendly Los … United…
8 2024-06-08 United S… Colombia 1 5 Friendly Land… United…
9 2024-06-15 Colombia Bolivia 3 0 Friendly Denv… United…
10 2024-06-24 Colombia Paraguay 2 1 Copa Amér… Hous… United…
11 2024-06-28 Colombia Costa Ri… 3 0 Copa Amér… Glen… United…
12 2024-07-02 Brazil Colombia 1 1 Copa Amér… Sant… United…
13 2024-07-06 Colombia Panama 5 0 Copa Amér… Glen… United…
14 2024-07-10 Uruguay Colombia 0 1 Copa Amér… Char… United…
15 2024-07-14 Argentina Colombia 1 0 Copa Amér… Miam… United…
16 2024-09-06 Peru Colombia 1 1 FIFA Worl… Lima Peru
17 2024-09-10 Colombia Argentina 2 1 FIFA Worl… Barr… Colomb…
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>
Filtering data
<- scores %>%
home_scores select(id, score = home_score, team = home_team, adversary = away_team, date = date, cup) %>%
mutate(condition = ifelse(scores$neutral, "neutral", "home"))
<- scores %>%
away_scores select(id, score = away_score, team = away_team, adversary = home_team, date = date, cup) %>%
mutate(condition = ifelse(scores$neutral, "neutral", "away"))
<- union_all(home_scores, away_scores) total_scores
We also consider the ranking FIFA of the teams as covariate. For every game, select the up-to-date ranking
<- "https://raw.githubusercontent.com/Dato-Futbol/fifa-ranking/master/ranking_fifa_historical.csv"
path0 <- read_csv(url(path0), show_col_types = FALSE)
rank
<- rank %>%
rank select(team, total_points, date) %>%
group_by(date) %>%
mutate(ranking = rank(-total_points))
Merging the data will be
<- total_scores %>%
scores_rank left_join(rank, by = join_by(adversary == team), relationship = "many-to-many") %>%
filter(date.x >= date.y) %>%
mutate(diff_date = date.x - date.y) %>%
group_by(id, team) %>%
slice_min(diff_date) %>%
select(id, score, team, adversary, condition, date_game = date.x, ranking_adv = ranking, date_rank = date.y, cup)
We are separating the data for both Euro and Copa America. For both cases, we will use as test data the games played after the quarter-finals. i.e, we will finish predicting the results of the semi-finals and the finals.
= "2024-07-09"
date_euro = "2024-07-09"
date_ca
<- scores_rank %>%
train_euro filter(date_game < date_euro, cup == "Euro")
<- scores_rank %>%
train_america filter(date_game < date_ca, cup == "America")
<- scores_rank %>%
test_euro filter(date_game >= date_euro, cup == "Euro")
<- scores_rank %>%
test_america filter(date_game >= date_ca, cup == "America")
Fitting the model
To fit the model I would consider the scores are distributed as Poisson for simplicity, one better option is the Conway-Maxwell-Poisson as shown in (Florez, Guindani, and Vannucci 2024). However, we will use a simple model for computational reasons as we constantly update the model.
<- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_euro, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum")) model_euro
<- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_america, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum")) model_america
Estimation
Euro
<- model_euro$coefficients[2:24]
strength_att <- c(strength_att, 0 - sum(strength_att))
strength_att
<- model_euro$coefficients[25:47]
strength_def <- c(strength_def, 0 - sum(strength_def))
strength_def
<- data.frame(team = euro_2024, strength_att, strength_def, row.names = c()) coeffs
Now, we can plot the estimated strengths in the following way:
ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_point() +
geom_label_repel(aes(label = team),
box.padding = 0.25,
point.padding = 0.5,
segment.color = "grey50") +
xlab(expression("Better Attack" %->% "")) +
ylab(expression("Better Defense" %->% "")) +
ggtitle("Strengths of every team") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
This graph provides an intuition into the strengths and abilities of every team. Teams located in the origin have an overall performance, and below the axis are worse than the averages’ team.
Copa America
Similarly for Copa america
<- length(america_2024)
n <- model_america$coefficients[2:n]
strength_att <- c(strength_att, 0 - sum(strength_att))
strength_att
<- model_america$coefficients[(n+1):(2*n-1)]
strength_def <- c(strength_def, 0 - sum(strength_def))
strength_def
<- data.frame(team = america_2024, strength_att, strength_def, row.names = c()) coeffs
Now, we can plot the estimated strengths in the following way:
ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_point() +
geom_label_repel(aes(label = team),
box.padding = 0.25,
point.padding = 0.5,
segment.color = "grey50") +
xlab(expression("Better Attack" %->% "")) +
ylab(expression("Better Defense" %->% "")) +
ggtitle("Strengths of every team") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
Predictions
Euro
Now, we predict the scores of the future games, i.e, the semifinals and the final.
<- test_euro %>%
test_pred select(id, team, adversary, condition, ranking_adv)
<- posterior_predict(model_euro, newdata = test_pred)
predictions
<- apply(predictions, 2, mean)
mean_goals
<- test_pred %>%
test_pred ungroup() %>%
mutate(goals = mean_goals) %>%
group_by(id) %>%
summarise("Team 1" = first(team),
"Exp. Goals" = first(goals),
"Team 2" = last(team),
"Exp. Goals 2" = last(goals))
The predictions of future games are:
%>%
test_pred print(n = 1e3)
# A tibble: 20 × 5
id `Team 1` `Exp. Goals` `Team 2` `Exp. Goals 2`
<int> <chr> <dbl> <chr> <dbl>
1 103 France 0.395 Spain 0.840
2 104 England 1.88 Netherlands 1.62
3 105 England 0.632 Spain 1.56
4 106 Croatia 1.55 Portugal 1.29
5 107 Poland 2.18 Scotland 0.914
6 108 Denmark 0.744 Switzerland 0.922
7 109 Serbia 0.349 Spain 1.88
8 110 France 1.26 Italy 0.651
9 111 Austria 1.90 Slovenia 1.33
10 112 Germany 2.18 Hungary 0.889
11 113 Czech Republic 1.82 Georgia 1.85
12 114 Albania 1.40 Ukraine 0.798
13 115 Portugal 2.25 Scotland 0.762
14 116 Croatia 1.95 Poland 1.26
15 117 Denmark 1.06 Serbia 0.494
16 118 Spain 1.26 Switzerland 0.654
17 119 Belgium 0.683 France 0.831
18 120 Germany 2.03 Netherlands 1.99
19 121 Albania 4.52 Georgia 1.20
20 122 Czech Republic 1.08 Ukraine 1.38
Particullarly, the distribution of the prediction of the score for the final looks as shown below.
<- tibble("England" = predictions[,5], "Spain" = predictions[,6])
results_game
<- results_game %>%
df_count group_by(England, Spain) %>%
summarise(count = n(), .groups = "keep") %>%
ungroup()
<- nrow(results_game)
total_combinations
<- df_count %>%
df_count mutate(proportion = count/total_combinations)
ggplot(df_count, aes(x = England, y = Spain, fill = proportion)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Heatmap of Goals Scored by the two teams",
x = "England",
y = "Spain",
fill = "Probability") +
theme_minimal()
According to this, it is likely that the final will be Spain vs. England. Spain is expected to be the champion, with a predicted score of 2-0.
Copa America
Similarly, we can generate the predictions for the last round of the games in Copa América
<- test_america %>%
test_pred select(id, team, adversary, condition, ranking_adv)
<- posterior_predict(model_america, newdata = test_pred)
predictions
<- apply(predictions, 2, mean)
mean_goals
<- test_pred %>%
test_pred ungroup() %>%
mutate(goals = mean_goals) %>%
group_by(id) %>%
summarise("Team 1" = first(team),
"Exp. Goals" = first(goals),
"Team 2" = last(team),
"Exp. Goals 2" = last(goals))
%>%
test_pred print(n = 1e3)
# A tibble: 16 × 5
id `Team 1` `Exp. Goals` `Team 2` `Exp. Goals 2`
<int> <chr> <dbl> <chr> <dbl>
1 96 Argentina 1.90 Canada 0.432
2 97 Colombia 0.867 Uruguay 1.32
3 98 Canada 0.495 Uruguay 2.57
4 99 Argentina 1.02 Colombia 0.747
5 100 Bolivia 0.547 Venezuela 1.63
6 101 Argentina 1.09 Chile 0.173
7 102 Colombia 1.26 Peru 0.108
8 103 Paraguay 0.168 Uruguay 2.33
9 104 Brazil 1.61 Ecuador 0.721
10 105 United States 2.08 United States 2.08
11 106 Canada 1.05 Mexico 1.82
12 107 Argentina 0.897 Colombia 0.922
13 108 Ecuador 1.24 Peru 0.0756
14 109 Bolivia 0.260 Chile 1.14
15 110 Uruguay 1.32 Venezuela 0.755
16 111 Brazil 1.19 Paraguay 0.412
In this case, Argentina is expected to easily defeat Canada, and Uruguay will likely beat Colombia. However, Colombia won its game 1-0, so the final will be Argentina vs. Colombia, while the 3rd and 4th place game will be played between Canada and Uruguay.
The final is anticipated to be a close match, with the most likely result being 0-0, as indicated below. However, Argentina is expected to have a slightly higher probability of scoring more goals than Colombia.
<- tibble("Argentina" = predictions[,7], "Colombia" = predictions[,8])
results_game
<- results_game %>%
df_count group_by(Argentina, Colombia) %>%
summarise(count = n(), .groups = "keep") %>%
ungroup()
<- nrow(results_game)
total_combinations
<- df_count %>%
df_count mutate(proportion = count/total_combinations)
ggplot(df_count, aes(x = Argentina, y = Colombia, fill = proportion)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Heatmap of Goals Scored by the two teams",
x = "Argentina",
y = "Colombia",
fill = "Probability") +
theme_minimal()
As we saw, the model accurately predicted the scores and captured the strengths of each team. It also produced reasonable and accurate results for the final games based on the quarter-final outcomes. Spain defeated England 2-1, while the match between Argentina and Colombia ended 0-0, with Argentina scoring a goal in extra time.
For future models, incorporating dynamic strengths could improve accuracy.