require(rstanarm)
require(rstantools)
require(bayesplot)
Predicting Euro & Copa America - 2024
In this project, I will propose a model for predicting the Euro & Copa America - 2024 scores based on the model presented in (Florez, Guindani, and Vannucci 2024).
Packages
I will use a Bayesian model, so we need the following libraries
As well as
require(readr)
require(dplyr)
require(lubridate)
require(ggplot2)
require(ggrepel)
require(latex2exp)
Data
The data is contained in this repository https://github.com/martj42/international_results.
= "https://raw.githubusercontent.com/martj42/international_results/master/results.csv" urlfile
We load the data
<- read_csv(url(urlfile), show_col_types = FALSE) data
The proposed model is based on the teams’ strengths attacking and defending. To train the model, we used the scores of games played since season 2023-2024.
= c('Albania','Austria','Belgium','Croatia','Czech Republic','Denmark','England','France','Georgia','Germany','Hungary','Italy','Netherlands','Poland','Portugal','Romania','Scotland','Serbia','Slovakia','Slovenia','Spain','Switzerland','Turkey','Ukraine')
euro_2024
= sort(c('Argentina', 'Chile', 'Peru', 'Canada', 'Mexico', 'Ecuador', 'Venezuela', 'Jamaica', 'United States', 'Uruguay', 'Panama', 'Bolivia', 'Brazil', 'Colombia', 'Paraguay', 'Costa Rica'))
america_2024
<- data %>%
scores filter(home_team %in% euro_2024 & away_team %in% euro_2024) %>%
filter(date > "2023-06-01") %>%
mutate(id = row_number(), cup = "Euro") %>%
bind_rows(data %>%
filter(home_team %in% america_2024 & away_team %in% america_2024) %>%
filter(date > "2023-06-01") %>%
mutate(id = row_number(), cup = "America")
)
For example, we consider the following games played by Portugal.
%>%
scores filter(home_team == "Portugal" | away_team == "Portugal")
# A tibble: 19 × 11
date home_team away_team home_score away_score tournament city country
<date> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2023-09-08 Slovakia Portugal 0 1 UEFA Euro… Brat… Slovak…
2 2023-10-13 Portugal Slovakia 3 2 UEFA Euro… Porto Portug…
3 2024-03-26 Slovenia Portugal 2 0 Friendly Ljub… Sloven…
4 2024-06-08 Portugal Croatia 1 2 Friendly Oeir… Portug…
5 2024-06-18 Portugal Czech Re… 2 1 UEFA Euro Leip… Germany
6 2024-06-22 Turkey Portugal 0 3 UEFA Euro Dort… Germany
7 2024-06-26 Georgia Portugal 2 0 UEFA Euro Gels… Germany
8 2024-07-01 Portugal Slovenia 0 0 UEFA Euro Fran… Germany
9 2024-07-05 Portugal France 0 0 UEFA Euro Hamb… Germany
10 2024-09-05 Portugal Croatia 2 1 UEFA Nati… Lisb… Portug…
11 2024-09-08 Portugal Scotland 2 1 UEFA Nati… Lisb… Portug…
12 2024-10-12 Poland Portugal 1 3 UEFA Nati… Wars… Poland
13 2024-10-15 Scotland Portugal 0 0 UEFA Nati… Glas… Scotla…
14 2024-11-15 Portugal Poland 5 1 UEFA Nati… Porto Portug…
15 2024-11-18 Croatia Portugal 1 1 UEFA Nati… Split Croatia
16 2025-03-20 Denmark Portugal 1 0 UEFA Nati… Cope… Denmark
17 2025-03-23 Portugal Denmark 5 2 UEFA Nati… Lisb… Portug…
18 2025-06-04 Germany Portugal 1 2 UEFA Nati… Muni… Germany
19 2025-06-08 Portugal Spain 2 2 UEFA Nati… Muni… Germany
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>
and the following games played by Colombia
%>%
scores filter(home_team == "Colombia" | away_team == "Colombia")
# A tibble: 25 × 11
date home_team away_team home_score away_score tournament city country
<date> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2023-09-07 Colombia Venezuela 1 0 FIFA Worl… Barr… Colomb…
2 2023-09-12 Chile Colombia 0 0 FIFA Worl… Sant… Chile
3 2023-10-12 Colombia Uruguay 2 2 FIFA Worl… Barr… Colomb…
4 2023-10-17 Ecuador Colombia 0 0 FIFA Worl… Quito Ecuador
5 2023-11-16 Colombia Brazil 2 1 FIFA Worl… Barr… Colomb…
6 2023-11-21 Paraguay Colombia 0 1 FIFA Worl… Asun… Paragu…
7 2023-12-16 Mexico Colombia 2 3 Friendly Los … United…
8 2024-06-08 United S… Colombia 1 5 Friendly Land… United…
9 2024-06-15 Colombia Bolivia 3 0 Friendly Denv… United…
10 2024-06-24 Colombia Paraguay 2 1 Copa Amér… Hous… United…
# ℹ 15 more rows
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>
Filtering data
<- scores %>%
home_scores select(id, score = home_score, team = home_team, adversary = away_team, date = date, cup) %>%
mutate(condition = ifelse(scores$neutral, "neutral", "home"))
<- scores %>%
away_scores select(id, score = away_score, team = away_team, adversary = home_team, date = date, cup) %>%
mutate(condition = ifelse(scores$neutral, "neutral", "away"))
<- union_all(home_scores, away_scores) total_scores
We also consider the ranking FIFA of the teams as covariate. For every game, select the up-to-date ranking
<- "https://raw.githubusercontent.com/Dato-Futbol/fifa-ranking/master/ranking_fifa_historical.csv"
path0 <- read_csv(url(path0), show_col_types = FALSE)
rank
<- rank %>%
rank select(team, total_points, date) %>%
group_by(date) %>%
mutate(ranking = rank(-total_points))
Merging the data will be
<- total_scores %>%
scores_rank left_join(rank, by = join_by(adversary == team), relationship = "many-to-many") %>%
filter(date.x >= date.y) %>%
mutate(diff_date = date.x - date.y) %>%
group_by(id, team) %>%
slice_min(diff_date) %>%
select(id, score, team, adversary, condition, date_game = date.x, ranking_adv = ranking, date_rank = date.y, cup)
We are separating the data for both Euro and Copa America. For both cases, we will use as test data the games played after the quarter-finals. i.e, we will finish predicting the results of the semi-finals and the finals.
= "2024-07-09"
date_euro = "2024-07-09"
date_ca
<- scores_rank %>%
train_euro filter(date_game < date_euro, cup == "Euro")
<- scores_rank %>%
train_america filter(date_game < date_ca, cup == "America")
<- scores_rank %>%
test_euro filter(date_game >= date_euro, cup == "Euro")
<- scores_rank %>%
test_america filter(date_game >= date_ca, cup == "America")
Fitting the model
To fit the model I would consider the scores are distributed as Poisson for simplicity, one better option is the Conway-Maxwell-Poisson as shown in (Florez, Guindani, and Vannucci 2024). However, we will use a simple model for computational reasons as we constantly update the model.
<- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_euro, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum")) model_euro
<- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_america, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum")) model_america
Estimation
Euro
<- model_euro$coefficients[2:24]
strength_att <- c(strength_att, 0 - sum(strength_att))
strength_att
<- model_euro$coefficients[25:47]
strength_def <- c(strength_def, 0 - sum(strength_def))
strength_def
<- data.frame(team = euro_2024, strength_att, strength_def, row.names = c()) coeffs
Now, we can plot the estimated strengths in the following way:
ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_point() +
geom_label_repel(aes(label = team),
box.padding = 0.25,
point.padding = 0.5,
segment.color = "grey50") +
xlab(expression("Better Attack" %->% "")) +
ylab(expression("Better Defense" %->% "")) +
ggtitle("Strengths of every team") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
This graph provides an intuition into the strengths and abilities of every team. Teams located in the origin have an overall performance, and below the axis are worse than the averages’ team.
Copa America
Similarly for Copa america
<- length(america_2024)
n <- model_america$coefficients[2:n]
strength_att <- c(strength_att, 0 - sum(strength_att))
strength_att
<- model_america$coefficients[(n+1):(2*n-1)]
strength_def <- c(strength_def, 0 - sum(strength_def))
strength_def
<- data.frame(team = america_2024, strength_att, strength_def, row.names = c()) coeffs
Now, we can plot the estimated strengths in the following way:
ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_point() +
geom_label_repel(aes(label = team),
box.padding = 0.25,
point.padding = 0.5,
segment.color = "grey50") +
xlab(expression("Better Attack" %->% "")) +
ylab(expression("Better Defense" %->% "")) +
ggtitle("Strengths of every team") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
Predictions
Euro
Now, we predict the scores of the future games, i.e, the semifinals and the final.
<- test_euro %>%
test_pred select(id, team, adversary, condition, ranking_adv)
<- posterior_predict(model_euro, newdata = test_pred)
predictions
<- apply(predictions, 2, mean)
mean_goals
<- test_pred %>%
test_pred ungroup() %>%
mutate(goals = mean_goals) %>%
group_by(id) %>%
summarise("Team 1" = first(team),
"Exp. Goals" = first(goals),
"Team 2" = last(team),
"Exp. Goals 2" = last(goals))
The predictions of future games are:
%>%
test_pred print(n = 1e3)
# A tibble: 77 × 5
id `Team 1` `Exp. Goals` `Team 2` `Exp. Goals 2`
<int> <chr> <dbl> <chr> <dbl>
1 103 France 0.393 Spain 0.858
2 104 England 1.90 Netherlands 1.60
3 105 England 0.632 Spain 1.55
4 106 Croatia 1.55 Portugal 1.32
5 107 Poland 2.21 Scotland 0.928
6 108 Denmark 0.727 Switzerland 0.935
7 109 Serbia 0.356 Spain 1.88
8 110 France 1.29 Italy 0.662
9 111 Austria 1.85 Slovenia 1.33
10 112 Germany 2.19 Hungary 0.879
11 113 Czech Republic 1.79 Georgia 1.87
12 114 Albania 1.38 Ukraine 0.784
13 115 Portugal 2.33 Scotland 0.765
14 116 Croatia 1.94 Poland 1.25
15 117 Denmark 1.06 Serbia 0.499
16 118 Spain 1.25 Switzerland 0.645
17 119 Belgium 0.666 France 0.851
18 120 Germany 2.02 Netherlands 1.99
19 121 Albania 4.41 Georgia 1.19
20 122 Czech Republic 1.07 Ukraine 1.38
21 123 Belgium 1.12 Italy 0.869
22 124 Hungary 1.71 Netherlands 2.47
23 125 Albania 1.92 Czech Republic 0.851
24 126 Georgia 1.10 Ukraine 1.95
25 127 Croatia 3.61 Scotland 0.759
26 128 Poland 1.50 Portugal 1.16
27 129 Serbia 0.512 Switzerland 1.22
28 130 Denmark 0.428 Spain 1.76
29 131 Belgium 0.797 France 0.710
30 132 Germany 2.41 Netherlands 1.67
31 133 Albania 3.75 Georgia 1.42
32 134 Czech Republic 0.678 Ukraine 1.07
33 135 Portugal 2.14 Scotland 0.918
34 136 Croatia 1.64 Poland 1.47
35 137 Serbia 0.292 Spain 2.30
36 138 Denmark 0.619 Switzerland 1.09
37 139 Belgium 1.33 Italy 0.717
38 140 Poland 1.23 Portugal 1.38
39 141 Croatia 3.06 Scotland 0.906
40 142 Denmark 0.507 Spain 1.48
41 143 Serbia 0.432 Switzerland 1.44
42 144 Hungary 1.43 Netherlands 2.93
43 145 Albania 2.25 Czech Republic 0.706
44 146 Georgia 1.69 Ukraine 2.58
45 147 France 1.10 Italy 0.804
46 148 Austria 2.22 Slovenia 1.10
47 149 Croatia 1.84 Portugal 1.12
48 150 Poland 2.86 Scotland 0.768
49 151 Denmark 0.928 Serbia 0.580
50 152 Spain 1.50 Switzerland 0.536
51 153 Germany 1.85 Hungary 1.05
52 154 Albania 2.16 Ukraine 1.05
53 155 Czech Republic 2.19 Georgia 1.53
54 156 Croatia 1.06 France 1.21
55 157 Denmark 1.18 Portugal 0.831
56 158 Germany 1.46 Italy 1.16
57 159 Netherlands 1.06 Spain 2.59
58 160 Austria 2.70 Serbia 0.652
59 161 Hungary 1.26 Turkey 1.78
60 162 Belgium 1.05 Ukraine 0.515
61 163 Slovakia 0.898 Slovenia 1.36
62 164 Albania 1.65 England 2.12
63 165 Croatia 0.884 France 1.40
64 166 Germany 1.70 Italy 0.978
65 167 Denmark 0.978 Portugal 0.985
66 168 Netherlands 0.877 Spain 3.10
67 169 Belgium 1.61 Ukraine 0.662
68 170 Hungary 1.48 Turkey 1.51
69 171 Austria 2.29 Serbia 0.772
70 172 Slovakia 0.759 Slovenia 1.62
71 173 Germany 1.86 Portugal 0.914
72 174 France 0.399 Spain 0.866
73 175 Austria 3.40 Romania 1.52
74 176 Albania 1.88 Serbia 0.62
75 177 France 0.942 Germany 1.01
76 178 Portugal 0.376 Spain 1.52
77 179 Croatia 2.37 Czech Republic 0.804
Particullarly, the distribution of the prediction of the score for the final looks as shown below.
<- tibble("England" = predictions[,5], "Spain" = predictions[,6])
results_game
<- results_game %>%
df_count group_by(England, Spain) %>%
summarise(count = n(), .groups = "keep") %>%
ungroup()
<- nrow(results_game)
total_combinations
<- df_count %>%
df_count mutate(proportion = count/total_combinations)
ggplot(df_count, aes(x = England, y = Spain, fill = proportion)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Heatmap of Goals Scored by the two teams",
x = "England",
y = "Spain",
fill = "Probability") +
theme_minimal()
According to this, it is likely that the final will be Spain vs. England. Spain is expected to be the champion, with a predicted score of 2-0.
Copa America
Similarly, we can generate the predictions for the last round of the games in Copa América
<- test_america %>%
test_pred select(id, team, adversary, condition, ranking_adv)
<- posterior_predict(model_america, newdata = test_pred)
predictions
<- apply(predictions, 2, mean)
mean_goals
<- test_pred %>%
test_pred ungroup() %>%
mutate(goals = mean_goals) %>%
group_by(id) %>%
summarise("Team 1" = first(team),
"Exp. Goals" = first(goals),
"Team 2" = last(team),
"Exp. Goals 2" = last(goals))
%>%
test_pred print(n = 1e3)
# A tibble: 74 × 5
id `Team 1` `Exp. Goals` `Team 2` `Exp. Goals 2`
<int> <chr> <dbl> <chr> <dbl>
1 96 Argentina 1.93 Canada 0.433
2 97 Colombia 0.872 Uruguay 1.33
3 98 Canada 0.498 Uruguay 2.59
4 99 Argentina 0.989 Colombia 0.766
5 100 Bolivia 0.544 Venezuela 1.62
6 101 Argentina 1.08 Chile 0.181
7 102 Colombia 1.27 Peru 0.108
8 103 Paraguay 0.173 Uruguay 2.32
9 104 Brazil 1.60 Ecuador 0.728
10 105 United States 2.10 United States 2.10
11 106 Canada 1.04 Mexico 1.80
12 107 Argentina 0.888 Colombia 0.952
13 108 Ecuador 1.25 Peru 0.0772
14 109 Bolivia 0.256 Chile 1.14
15 110 Uruguay 1.30 Venezuela 0.761
16 111 Brazil 1.18 Paraguay 0.416
17 112 Bolivia 0.499 Colombia 2.92
18 113 Ecuador 1.56 Paraguay 0.285
19 114 Argentina 0.899 Venezuela 0.609
20 115 Brazil 0.654 Chile 0.537
21 116 Peru 0.0724 Uruguay 1.32
22 117 United States 2.68 United States 2.68
23 118 Canada 2.76 Panama 1.06
24 119 United States 0.819 United States 0.819
25 120 Chile 0.324 Colombia 1.20
26 121 Paraguay 0.399 Venezuela 1.08
27 122 Ecuador 0.46 Uruguay 1.98
28 123 Argentina 3.58 Bolivia 0.170
29 124 Brazil 1.60 Peru 0.0665
30 125 Costa Rica 1.59 Panama 0.996
31 126 United States 1.33 United States 1.33
32 127 Brazil 0.938 Venezuela 1.22
33 128 Bolivia 0.359 Ecuador 2.72
34 129 Argentina 1.29 Paraguay 0.208
35 130 Chile 0.346 Peru 0.0751
36 131 Colombia 0.747 Uruguay 1.78
37 132 Costa Rica 1.02 Panama 1.53
38 133 United States 2.04 United States 2.04
39 134 Bolivia 0.706 Paraguay 0.659
40 135 Colombia 1.92 Ecuador 0.676
41 136 Argentina 1.50 Peru 0.0372
42 137 Chile 0.509 Venezuela 0.502
43 138 Brazil 0.951 Uruguay 1.21
44 139 United States 0.990 United States 0.990
45 140 United States 2.10 United States 2.10
46 141 Chile 1.42 Panama 0.329
47 142 Canada 1.05 Mexico 1.93
48 143 United States 2.68 United States 2.68
49 144 Chile 0.465 Paraguay 0.267
50 145 Bolivia 0.355 Peru 0.275
51 146 Brazil 1.42 Colombia 1.16
52 147 Ecuador 1.06 Venezuela 0.778
53 148 Argentina 0.588 Uruguay 0.954
54 149 United States 2.21 United States 2.21
55 150 Mexico 2.31 Panama 0.739
56 151 Bolivia 0.331 Uruguay 2.99
57 152 Colombia 2.49 Paraguay 0.255
58 153 Peru 0.0665 Venezuela 1.27
59 154 Argentina 1.44 Brazil 0.502
60 155 Chile 0.563 Ecuador 0.492
61 156 Brazil 1.02 Ecuador 1.13
62 157 Paraguay 0.260 Uruguay 1.71
63 158 Argentina 0.611 Chile 0.265
64 159 Colombia 1.97 Peru 0.068
65 160 Bolivia 0.315 Venezuela 3.01
66 161 Bolivia 0.335 Chile 0.850
67 162 Uruguay 1.85 Venezuela 0.486
68 163 Ecuador 0.774 Peru 0.116
69 164 Argentina 1.36 Colombia 0.615
70 165 Brazil 2.09 Paraguay 0.260
71 166 Costa Rica 0.586 Mexico 1.78
72 167 Jamaica 1.63 Panama 1.25
73 168 United States 2.11 United States 2.11
74 169 United States 1.25 United States 1.25
In this case, Argentina is expected to easily defeat Canada, and Uruguay will likely beat Colombia. However, Colombia won its game 1-0, so the final will be Argentina vs. Colombia, while the 3rd and 4th place game will be played between Canada and Uruguay.
The final is anticipated to be a close match, with the most likely result being 0-0, as indicated below. However, Argentina is expected to have a slightly higher probability of scoring more goals than Colombia.
<- tibble("Argentina" = predictions[,7], "Colombia" = predictions[,8])
results_game
<- results_game %>%
df_count group_by(Argentina, Colombia) %>%
summarise(count = n(), .groups = "keep") %>%
ungroup()
<- nrow(results_game)
total_combinations
<- df_count %>%
df_count mutate(proportion = count/total_combinations)
ggplot(df_count, aes(x = Argentina, y = Colombia, fill = proportion)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Heatmap of Goals Scored by the two teams",
x = "Argentina",
y = "Colombia",
fill = "Probability") +
theme_minimal()
As we saw, the model accurately predicted the scores and captured the strengths of each team. It also produced reasonable and accurate results for the final games based on the quarter-final outcomes. Spain defeated England 2-1, while the match between Argentina and Colombia ended 0-0, with Argentina scoring a goal in extra time.
For future models, incorporating dynamic strengths could improve accuracy.