require(rstanarm)
require(rstantools)
require(bayesplot)
Predicting Euro & Copa America - 2024
In this project, I will propose a model for predicting the Euro & Copa America - 2024 scores based on the model presented in (Florez, Guindani, and Vannucci 2024).
Packages
I will use a Bayesian model, so we need the following libraries
As well as
require(readr)
require(dplyr)
require(lubridate)
require(ggplot2)
require(ggrepel)
require(latex2exp)
Data
The data is contained in this repository https://github.com/martj42/international_results.
= "https://raw.githubusercontent.com/martj42/international_results/master/results.csv" urlfile
We load the data
<- read_csv(url(urlfile), show_col_types = FALSE) data
The proposed model is based on the teams’ strengths attacking and defending. To train the model, we used the scores of games played since season 2023-2024.
= c('Albania','Austria','Belgium','Croatia','Czech Republic','Denmark','England','France','Georgia','Germany','Hungary','Italy','Netherlands','Poland','Portugal','Romania','Scotland','Serbia','Slovakia','Slovenia','Spain','Switzerland','Turkey','Ukraine')
euro_2024
= sort(c('Argentina', 'Chile', 'Peru', 'Canada', 'Mexico', 'Ecuador', 'Venezuela', 'Jamaica', 'United States', 'Uruguay', 'Panama', 'Bolivia', 'Brazil', 'Colombia', 'Paraguay', 'Costa Rica'))
america_2024
<- data %>%
scores filter(home_team %in% euro_2024 & away_team %in% euro_2024) %>%
filter(date > "2023-06-01") %>%
mutate(id = row_number(), cup = "Euro") %>%
bind_rows(data %>%
filter(home_team %in% america_2024 & away_team %in% america_2024) %>%
filter(date > "2023-06-01") %>%
mutate(id = row_number(), cup = "America")
)
For example, we consider the following games played by Portugal.
%>%
scores filter(home_team == "Portugal" | away_team == "Portugal")
# A tibble: 19 × 11
date home_team away_team home_score away_score tournament city country
<date> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2023-09-08 Slovakia Portugal 0 1 UEFA Euro… Brat… Slovak…
2 2023-10-13 Portugal Slovakia 3 2 UEFA Euro… Porto Portug…
3 2024-03-26 Slovenia Portugal 2 0 Friendly Ljub… Sloven…
4 2024-06-08 Portugal Croatia 1 2 Friendly Oeir… Portug…
5 2024-06-18 Portugal Czech Re… 2 1 UEFA Euro Leip… Germany
6 2024-06-22 Turkey Portugal 0 3 UEFA Euro Dort… Germany
7 2024-06-26 Georgia Portugal 2 0 UEFA Euro Gels… Germany
8 2024-07-01 Portugal Slovenia 0 0 UEFA Euro Fran… Germany
9 2024-07-05 Portugal France 0 0 UEFA Euro Hamb… Germany
10 2024-09-05 Portugal Croatia 2 1 UEFA Nati… Lisb… Portug…
11 2024-09-08 Portugal Scotland 2 1 UEFA Nati… Lisb… Portug…
12 2024-10-12 Poland Portugal 1 3 UEFA Nati… Wars… Poland
13 2024-10-15 Scotland Portugal 0 0 UEFA Nati… Glas… Scotla…
14 2024-11-15 Portugal Poland 5 1 UEFA Nati… Porto Portug…
15 2024-11-18 Croatia Portugal 1 1 UEFA Nati… Split Croatia
16 2025-03-20 Denmark Portugal 1 0 UEFA Nati… Cope… Denmark
17 2025-03-23 Portugal Denmark 5 2 UEFA Nati… Lisb… Portug…
18 2025-06-04 Germany Portugal 1 2 UEFA Nati… Muni… Germany
19 2025-06-08 Portugal Spain 2 2 UEFA Nati… Muni… Germany
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>
and the following games played by Colombia
%>%
scores filter(home_team == "Colombia" | away_team == "Colombia")
# A tibble: 25 × 11
date home_team away_team home_score away_score tournament city country
<date> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2023-09-07 Colombia Venezuela 1 0 FIFA Worl… Barr… Colomb…
2 2023-09-12 Chile Colombia 0 0 FIFA Worl… Sant… Chile
3 2023-10-12 Colombia Uruguay 2 2 FIFA Worl… Barr… Colomb…
4 2023-10-17 Ecuador Colombia 0 0 FIFA Worl… Quito Ecuador
5 2023-11-16 Colombia Brazil 2 1 FIFA Worl… Barr… Colomb…
6 2023-11-21 Paraguay Colombia 0 1 FIFA Worl… Asun… Paragu…
7 2023-12-16 Mexico Colombia 2 3 Friendly Los … United…
8 2024-06-08 United S… Colombia 1 5 Friendly Land… United…
9 2024-06-15 Colombia Bolivia 3 0 Friendly Denv… United…
10 2024-06-24 Colombia Paraguay 2 1 Copa Amér… Hous… United…
# ℹ 15 more rows
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>
Filtering data
<- scores %>%
home_scores select(id, score = home_score, team = home_team, adversary = away_team, date = date, cup) %>%
mutate(condition = ifelse(scores$neutral, "neutral", "home"))
<- scores %>%
away_scores select(id, score = away_score, team = away_team, adversary = home_team, date = date, cup) %>%
mutate(condition = ifelse(scores$neutral, "neutral", "away"))
<- union_all(home_scores, away_scores) total_scores
We also consider the ranking FIFA of the teams as covariate. For every game, select the up-to-date ranking
<- "https://raw.githubusercontent.com/Dato-Futbol/fifa-ranking/master/ranking_fifa_historical.csv"
path0 <- read_csv(url(path0), show_col_types = FALSE)
rank
<- rank %>%
rank select(team, total_points, date) %>%
group_by(date) %>%
mutate(ranking = rank(-total_points))
Merging the data will be
<- total_scores %>%
scores_rank left_join(rank, by = join_by(adversary == team), relationship = "many-to-many") %>%
filter(date.x >= date.y) %>%
mutate(diff_date = date.x - date.y) %>%
group_by(id, team) %>%
slice_min(diff_date) %>%
select(id, score, team, adversary, condition, date_game = date.x, ranking_adv = ranking, date_rank = date.y, cup)
We are separating the data for both Euro and Copa America. For both cases, we will use as test data the games played after the quarter-finals. i.e, we will finish predicting the results of the semi-finals and the finals.
= "2024-07-09"
date_euro = "2024-07-09"
date_ca
<- scores_rank %>%
train_euro filter(date_game < date_euro, cup == "Euro")
<- scores_rank %>%
train_america filter(date_game < date_ca, cup == "America")
<- scores_rank %>%
test_euro filter(date_game >= date_euro, cup == "Euro")
<- scores_rank %>%
test_america filter(date_game >= date_ca, cup == "America")
Fitting the model
To fit the model I would consider the scores are distributed as Poisson for simplicity, one better option is the Conway-Maxwell-Poisson as shown in (Florez, Guindani, and Vannucci 2024). However, we will use a simple model for computational reasons as we constantly update the model.
<- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_euro, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum")) model_euro
<- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_america, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum")) model_america
Estimation
Euro
<- model_euro$coefficients[2:24]
strength_att <- c(strength_att, 0 - sum(strength_att))
strength_att
<- model_euro$coefficients[25:47]
strength_def <- c(strength_def, 0 - sum(strength_def))
strength_def
<- data.frame(team = euro_2024, strength_att, strength_def, row.names = c()) coeffs
Now, we can plot the estimated strengths in the following way:
ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_point() +
geom_label_repel(aes(label = team),
box.padding = 0.25,
point.padding = 0.5,
segment.color = "grey50") +
xlab(expression("Better Attack" %->% "")) +
ylab(expression("Better Defense" %->% "")) +
ggtitle("Strengths of every team") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
This graph provides an intuition into the strengths and abilities of every team. Teams located in the origin have an overall performance, and below the axis are worse than the averages’ team.
Copa America
Similarly for Copa america
<- length(america_2024)
n <- model_america$coefficients[2:n]
strength_att <- c(strength_att, 0 - sum(strength_att))
strength_att
<- model_america$coefficients[(n+1):(2*n-1)]
strength_def <- c(strength_def, 0 - sum(strength_def))
strength_def
<- data.frame(team = america_2024, strength_att, strength_def, row.names = c()) coeffs
Now, we can plot the estimated strengths in the following way:
ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_point() +
geom_label_repel(aes(label = team),
box.padding = 0.25,
point.padding = 0.5,
segment.color = "grey50") +
xlab(expression("Better Attack" %->% "")) +
ylab(expression("Better Defense" %->% "")) +
ggtitle("Strengths of every team") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
Predictions
Euro
Now, we predict the scores of the future games, i.e, the semifinals and the final.
<- test_euro %>%
test_pred select(id, team, adversary, condition, ranking_adv)
<- posterior_predict(model_euro, newdata = test_pred)
predictions
<- apply(predictions, 2, mean)
mean_goals
<- test_pred %>%
test_pred ungroup() %>%
mutate(goals = mean_goals) %>%
group_by(id) %>%
summarise("Team 1" = first(team),
"Exp. Goals" = first(goals),
"Team 2" = last(team),
"Exp. Goals 2" = last(goals))
The predictions of future games are:
%>%
test_pred print(n = 1e3)
# A tibble: 77 × 5
id `Team 1` `Exp. Goals` `Team 2` `Exp. Goals 2`
<int> <chr> <dbl> <chr> <dbl>
1 103 France 0.402 Spain 0.842
2 104 England 1.89 Netherlands 1.60
3 105 England 0.626 Spain 1.53
4 106 Croatia 1.53 Portugal 1.30
5 107 Poland 2.16 Scotland 0.940
6 108 Denmark 0.748 Switzerland 0.941
7 109 Serbia 0.359 Spain 1.88
8 110 France 1.29 Italy 0.644
9 111 Austria 1.86 Slovenia 1.32
10 112 Germany 2.17 Hungary 0.889
11 113 Czech Republic 1.81 Georgia 1.88
12 114 Albania 1.38 Ukraine 0.789
13 115 Portugal 2.26 Scotland 0.754
14 116 Croatia 1.95 Poland 1.24
15 117 Denmark 1.08 Serbia 0.487
16 118 Spain 1.30 Switzerland 0.669
17 119 Belgium 0.669 France 0.835
18 120 Germany 2.02 Netherlands 2.01
19 121 Albania 4.40 Georgia 1.18
20 122 Czech Republic 1.09 Ukraine 1.38
21 123 Belgium 1.12 Italy 0.858
22 124 Hungary 1.74 Netherlands 2.43
23 125 Albania 1.92 Czech Republic 0.864
24 126 Georgia 1.09 Ukraine 1.97
25 127 Croatia 3.50 Scotland 0.75
26 128 Poland 1.46 Portugal 1.12
27 129 Serbia 0.532 Switzerland 1.20
28 130 Denmark 0.433 Spain 1.75
29 131 Belgium 0.808 France 0.715
30 132 Germany 2.40 Netherlands 1.67
31 133 Albania 3.84 Georgia 1.40
32 134 Czech Republic 0.709 Ukraine 1.07
33 135 Portugal 2.10 Scotland 0.891
34 136 Croatia 1.61 Poland 1.46
35 137 Serbia 0.307 Spain 2.24
36 138 Denmark 0.619 Switzerland 1.08
37 139 Belgium 1.34 Italy 0.712
38 140 Poland 1.24 Portugal 1.36
39 141 Croatia 2.95 Scotland 0.921
40 142 Denmark 0.523 Spain 1.47
41 143 Serbia 0.446 Switzerland 1.42
42 144 Hungary 1.42 Netherlands 2.90
43 145 Albania 2.27 Czech Republic 0.713
44 146 Georgia 1.70 Ukraine 2.59
45 147 France 1.08 Italy 0.793
46 148 Austria 2.21 Slovenia 1.12
47 149 Croatia 1.84 Portugal 1.11
48 150 Poland 2.70 Scotland 0.791
49 151 Denmark 0.933 Serbia 0.590
50 152 Spain 1.53 Switzerland 0.555
51 153 Germany 1.84 Hungary 1.07
52 154 Albania 2.21 Ukraine 1.04
53 155 Czech Republic 2.26 Georgia 1.54
54 156 Croatia 1.06 France 1.18
55 157 Denmark 1.17 Portugal 0.845
56 158 Germany 1.46 Italy 1.18
57 159 Netherlands 1.07 Spain 2.56
58 160 Austria 2.66 Serbia 0.661
59 161 Hungary 1.24 Turkey 1.80
60 162 Belgium 1.04 Ukraine 0.511
61 163 Slovakia 0.914 Slovenia 1.38
62 164 Albania 1.63 England 2.12
63 165 Croatia 0.888 France 1.41
64 166 Germany 1.71 Italy 0.989
65 167 Denmark 1.00 Portugal 0.986
66 168 Netherlands 0.903 Spain 3.08
67 169 Belgium 1.62 Ukraine 0.641
68 170 Hungary 1.49 Turkey 1.48
69 171 Austria 2.26 Serbia 0.787
70 172 Slovakia 0.753 Slovenia 1.65
71 173 Germany 1.85 Portugal 0.906
72 174 France 0.401 Spain 0.838
73 175 Austria 3.41 Romania 1.53
74 176 Albania 1.86 Serbia 0.633
75 177 France 0.934 Germany 1.04
76 178 Portugal 0.360 Spain 1.51
77 179 Croatia 2.38 Czech Republic 0.825
Particullarly, the distribution of the prediction of the score for the final looks as shown below.
<- tibble("England" = predictions[,5], "Spain" = predictions[,6])
results_game
<- results_game %>%
df_count group_by(England, Spain) %>%
summarise(count = n(), .groups = "keep") %>%
ungroup()
<- nrow(results_game)
total_combinations
<- df_count %>%
df_count mutate(proportion = count/total_combinations)
ggplot(df_count, aes(x = England, y = Spain, fill = proportion)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Heatmap of Goals Scored by the two teams",
x = "England",
y = "Spain",
fill = "Probability") +
theme_minimal()
According to this, it is likely that the final will be Spain vs. England. Spain is expected to be the champion, with a predicted score of 2-0.
Copa America
Similarly, we can generate the predictions for the last round of the games in Copa América
<- test_america %>%
test_pred select(id, team, adversary, condition, ranking_adv)
<- posterior_predict(model_america, newdata = test_pred)
predictions
<- apply(predictions, 2, mean)
mean_goals
<- test_pred %>%
test_pred ungroup() %>%
mutate(goals = mean_goals) %>%
group_by(id) %>%
summarise("Team 1" = first(team),
"Exp. Goals" = first(goals),
"Team 2" = last(team),
"Exp. Goals 2" = last(goals))
%>%
test_pred print(n = 1e3)
# A tibble: 70 × 5
id `Team 1` `Exp. Goals` `Team 2` `Exp. Goals 2`
<int> <chr> <dbl> <chr> <dbl>
1 96 Argentina 1.92 Canada 0.416
2 97 Colombia 0.874 Uruguay 1.34
3 98 Canada 0.498 Uruguay 2.57
4 99 Argentina 1.03 Colombia 0.766
5 100 Bolivia 0.554 Venezuela 1.62
6 101 Argentina 1.09 Chile 0.171
7 102 Colombia 1.25 Peru 0.109
8 103 Paraguay 0.171 Uruguay 2.29
9 104 Brazil 1.58 Ecuador 0.728
10 105 United States 2.06 United States 2.06
11 106 Canada 1.01 Mexico 1.80
12 107 Argentina 0.888 Colombia 0.932
13 108 Ecuador 1.26 Peru 0.0765
14 109 Bolivia 0.258 Chile 1.11
15 110 Uruguay 1.33 Venezuela 0.754
16 111 Brazil 1.17 Paraguay 0.405
17 112 Bolivia 0.481 Colombia 2.89
18 113 Ecuador 1.57 Paraguay 0.281
19 114 Argentina 0.918 Venezuela 0.608
20 115 Brazil 0.649 Chile 0.543
21 116 Peru 0.0716 Uruguay 1.30
22 117 United States 2.72 United States 2.72
23 118 Canada 2.79 Panama 1.06
24 119 United States 0.826 United States 0.826
25 120 Chile 0.326 Colombia 1.19
26 121 Paraguay 0.396 Venezuela 1.08
27 122 Ecuador 0.450 Uruguay 1.98
28 123 Argentina 3.58 Bolivia 0.164
29 124 Brazil 1.61 Peru 0.0758
30 125 Costa Rica 1.60 Panama 1.00
31 126 United States 1.34 United States 1.34
32 127 Brazil 0.929 Venezuela 1.22
33 128 Bolivia 0.339 Ecuador 2.74
34 129 Argentina 1.28 Paraguay 0.210
35 130 Chile 0.364 Peru 0.0775
36 131 Colombia 0.746 Uruguay 1.78
37 132 Costa Rica 1.02 Panama 1.54
38 133 United States 2.01 United States 2.01
39 134 Bolivia 0.739 Paraguay 0.657
40 135 Colombia 1.89 Ecuador 0.671
41 136 Argentina 1.55 Peru 0.0361
42 137 Chile 0.513 Venezuela 0.507
43 138 Brazil 0.939 Uruguay 1.22
44 139 United States 0.996 United States 0.996
45 140 United States 2.12 United States 2.12
46 141 Chile 1.44 Panama 0.325
47 142 Canada 1.05 Mexico 1.94
48 143 United States 2.71 United States 2.71
49 144 Chile 0.472 Paraguay 0.262
50 145 Bolivia 0.365 Peru 0.273
51 146 Brazil 1.41 Colombia 1.16
52 147 Ecuador 1.04 Venezuela 0.782
53 148 Argentina 0.592 Uruguay 0.961
54 149 United States 2.22 United States 2.22
55 150 Mexico 2.34 Panama 0.774
56 151 Bolivia 0.339 Uruguay 3.01
57 152 Colombia 2.51 Paraguay 0.247
58 153 Peru 0.0707 Venezuela 1.29
59 154 Argentina 1.44 Brazil 0.497
60 155 Chile 0.561 Ecuador 0.488
61 156 Brazil 1.00 Ecuador 1.15
62 157 Paraguay 0.262 Uruguay 1.74
63 158 Argentina 0.619 Chile 0.274
64 159 Colombia 1.97 Peru 0.0684
65 160 Bolivia 0.336 Venezuela 2.99
66 161 Bolivia 0.340 Chile 0.848
67 162 Uruguay 1.82 Venezuela 0.474
68 163 Ecuador 0.761 Peru 0.121
69 164 Argentina 1.37 Colombia 0.584
70 165 Brazil 2.10 Paraguay 0.256
In this case, Argentina is expected to easily defeat Canada, and Uruguay will likely beat Colombia. However, Colombia won its game 1-0, so the final will be Argentina vs. Colombia, while the 3rd and 4th place game will be played between Canada and Uruguay.
The final is anticipated to be a close match, with the most likely result being 0-0, as indicated below. However, Argentina is expected to have a slightly higher probability of scoring more goals than Colombia.
<- tibble("Argentina" = predictions[,7], "Colombia" = predictions[,8])
results_game
<- results_game %>%
df_count group_by(Argentina, Colombia) %>%
summarise(count = n(), .groups = "keep") %>%
ungroup()
<- nrow(results_game)
total_combinations
<- df_count %>%
df_count mutate(proportion = count/total_combinations)
ggplot(df_count, aes(x = Argentina, y = Colombia, fill = proportion)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Heatmap of Goals Scored by the two teams",
x = "Argentina",
y = "Colombia",
fill = "Probability") +
theme_minimal()
As we saw, the model accurately predicted the scores and captured the strengths of each team. It also produced reasonable and accurate results for the final games based on the quarter-final outcomes. Spain defeated England 2-1, while the match between Argentina and Colombia ended 0-0, with Argentina scoring a goal in extra time.
For future models, incorporating dynamic strengths could improve accuracy.