Mauro Florez
  • Home
  • Projects
  • Teaching
  • Gallery
  • Resume

On this page

  • Packages
  • Data
    • Filtering data
  • Fitting the model
  • Estimation
    • Euro
    • Copa America
  • Predictions
    • Euro
    • Copa America

Predicting Euro & Copa America - 2024

A statistical model to analyze the results of the Euro Cup and the Copa America 2024.
Published

June 30, 2024

In this project, I will propose a model for predicting the Euro & Copa America - 2024 scores based on the model presented in (Florez, Guindani, and Vannucci 2024).

Packages

I will use a Bayesian model, so we need the following libraries

require(rstanarm)
require(rstantools)
require(bayesplot)

As well as

require(readr)
require(dplyr)
require(lubridate)
require(ggplot2)
require(ggrepel)
require(latex2exp)

Data

The data is contained in this repository https://github.com/martj42/international_results.

urlfile = "https://raw.githubusercontent.com/martj42/international_results/master/results.csv"

We load the data

data <- read_csv(url(urlfile), show_col_types = FALSE)

The proposed model is based on the teams’ strengths attacking and defending. To train the model, we used the scores of games played since season 2023-2024.

euro_2024 = c('Albania','Austria','Belgium','Croatia','Czech Republic','Denmark','England','France','Georgia','Germany','Hungary','Italy','Netherlands','Poland','Portugal','Romania','Scotland','Serbia','Slovakia','Slovenia','Spain','Switzerland','Turkey','Ukraine')

america_2024 = sort(c('Argentina', 'Chile', 'Peru', 'Canada', 'Mexico', 'Ecuador', 'Venezuela', 'Jamaica', 'United States', 'Uruguay', 'Panama', 'Bolivia', 'Brazil', 'Colombia', 'Paraguay', 'Costa Rica'))  

scores <- data %>% 
  filter(home_team %in% euro_2024 & away_team %in% euro_2024) %>% 
  filter(date > "2023-06-01") %>% 
  mutate(id = row_number(), cup = "Euro") %>% 
  bind_rows(data %>% 
  filter(home_team %in% america_2024 & away_team %in% america_2024) %>% 
  filter(date > "2023-06-01") %>% 
  mutate(id = row_number(), cup = "America")
  )

For example, we consider the following games played by Portugal.

scores %>% 
  filter(home_team == "Portugal" | away_team == "Portugal")
# A tibble: 19 × 11
   date       home_team away_team home_score away_score tournament city  country
   <date>     <chr>     <chr>          <dbl>      <dbl> <chr>      <chr> <chr>  
 1 2023-09-08 Slovakia  Portugal           0          1 UEFA Euro… Brat… Slovak…
 2 2023-10-13 Portugal  Slovakia           3          2 UEFA Euro… Porto Portug…
 3 2024-03-26 Slovenia  Portugal           2          0 Friendly   Ljub… Sloven…
 4 2024-06-08 Portugal  Croatia            1          2 Friendly   Oeir… Portug…
 5 2024-06-18 Portugal  Czech Re…          2          1 UEFA Euro  Leip… Germany
 6 2024-06-22 Turkey    Portugal           0          3 UEFA Euro  Dort… Germany
 7 2024-06-26 Georgia   Portugal           2          0 UEFA Euro  Gels… Germany
 8 2024-07-01 Portugal  Slovenia           0          0 UEFA Euro  Fran… Germany
 9 2024-07-05 Portugal  France             0          0 UEFA Euro  Hamb… Germany
10 2024-09-05 Portugal  Croatia            2          1 UEFA Nati… Lisb… Portug…
11 2024-09-08 Portugal  Scotland           2          1 UEFA Nati… Lisb… Portug…
12 2024-10-12 Poland    Portugal           1          3 UEFA Nati… Wars… Poland 
13 2024-10-15 Scotland  Portugal           0          0 UEFA Nati… Glas… Scotla…
14 2024-11-15 Portugal  Poland             5          1 UEFA Nati… Porto Portug…
15 2024-11-18 Croatia   Portugal           1          1 UEFA Nati… Split Croatia
16 2025-03-20 Denmark   Portugal           1          0 UEFA Nati… Cope… Denmark
17 2025-03-23 Portugal  Denmark            5          2 UEFA Nati… Lisb… Portug…
18 2025-06-04 Germany   Portugal           1          2 UEFA Nati… Muni… Germany
19 2025-06-08 Portugal  Spain              2          2 UEFA Nati… Muni… Germany
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>

and the following games played by Colombia

scores %>% 
  filter(home_team == "Colombia" | away_team == "Colombia")
# A tibble: 25 × 11
   date       home_team away_team home_score away_score tournament city  country
   <date>     <chr>     <chr>          <dbl>      <dbl> <chr>      <chr> <chr>  
 1 2023-09-07 Colombia  Venezuela          1          0 FIFA Worl… Barr… Colomb…
 2 2023-09-12 Chile     Colombia           0          0 FIFA Worl… Sant… Chile  
 3 2023-10-12 Colombia  Uruguay            2          2 FIFA Worl… Barr… Colomb…
 4 2023-10-17 Ecuador   Colombia           0          0 FIFA Worl… Quito Ecuador
 5 2023-11-16 Colombia  Brazil             2          1 FIFA Worl… Barr… Colomb…
 6 2023-11-21 Paraguay  Colombia           0          1 FIFA Worl… Asun… Paragu…
 7 2023-12-16 Mexico    Colombia           2          3 Friendly   Los … United…
 8 2024-06-08 United S… Colombia           1          5 Friendly   Land… United…
 9 2024-06-15 Colombia  Bolivia            3          0 Friendly   Denv… United…
10 2024-06-24 Colombia  Paraguay           2          1 Copa Amér… Hous… United…
# ℹ 15 more rows
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>

Filtering data

home_scores <- scores %>% 
  select(id, score = home_score, team = home_team, adversary = away_team, date = date, cup) %>% 
  mutate(condition = ifelse(scores$neutral, "neutral", "home"))
  
away_scores <- scores %>% 
  select(id, score = away_score, team = away_team, adversary = home_team, date = date, cup) %>% 
  mutate(condition = ifelse(scores$neutral, "neutral", "away"))

total_scores <- union_all(home_scores, away_scores)

We also consider the ranking FIFA of the teams as covariate. For every game, select the up-to-date ranking

path0 <- "https://raw.githubusercontent.com/Dato-Futbol/fifa-ranking/master/ranking_fifa_historical.csv"
rank <- read_csv(url(path0), show_col_types = FALSE)

rank <- rank %>% 
  select(team, total_points, date) %>% 
  group_by(date) %>% 
  mutate(ranking = rank(-total_points))

Merging the data will be

scores_rank <- total_scores %>% 
  left_join(rank, by = join_by(adversary == team), relationship = "many-to-many") %>% 
  filter(date.x >= date.y) %>% 
  mutate(diff_date = date.x - date.y) %>% 
  group_by(id, team) %>% 
  slice_min(diff_date) %>% 
  select(id, score, team, adversary, condition, date_game = date.x, ranking_adv = ranking, date_rank = date.y, cup)

We are separating the data for both Euro and Copa America. For both cases, we will use as test data the games played after the quarter-finals. i.e, we will finish predicting the results of the semi-finals and the finals.

date_euro = "2024-07-09"
date_ca = "2024-07-09"

train_euro <- scores_rank %>% 
  filter(date_game < date_euro, cup == "Euro")

train_america <- scores_rank %>% 
  filter(date_game < date_ca, cup == "America")

test_euro <- scores_rank %>% 
  filter(date_game >= date_euro, cup == "Euro")

test_america <- scores_rank %>% 
  filter(date_game >= date_ca, cup == "America")

Fitting the model

To fit the model I would consider the scores are distributed as Poisson for simplicity, one better option is the Conway-Maxwell-Poisson as shown in (Florez, Guindani, and Vannucci 2024). However, we will use a simple model for computational reasons as we constantly update the model.

model_euro <- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_euro, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum"))
model_america <- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_america, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum"))

Estimation

Euro

strength_att <- model_euro$coefficients[2:24]
strength_att <- c(strength_att, 0 - sum(strength_att))

strength_def <- model_euro$coefficients[25:47]
strength_def <- c(strength_def, 0 - sum(strength_def))

coeffs <- data.frame(team = euro_2024, strength_att, strength_def, row.names = c())

Now, we can plot the estimated strengths in the following way:

ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_point() +
  geom_label_repel(aes(label = team),
                            box.padding = 0.25,
                            point.padding = 0.5,
                            segment.color = "grey50") +
  xlab(expression("Better Attack" %->% "")) +
  ylab(expression("Better Defense" %->% "")) + 
  ggtitle("Strengths of every team") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

This graph provides an intuition into the strengths and abilities of every team. Teams located in the origin have an overall performance, and below the axis are worse than the averages’ team.

Copa America

Similarly for Copa america

n <- length(america_2024)
strength_att <- model_america$coefficients[2:n]
strength_att <- c(strength_att, 0 - sum(strength_att))

strength_def <- model_america$coefficients[(n+1):(2*n-1)]
strength_def <- c(strength_def, 0 - sum(strength_def))

coeffs <- data.frame(team = america_2024, strength_att, strength_def, row.names = c())

Now, we can plot the estimated strengths in the following way:

ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_point() +
  geom_label_repel(aes(label = team),
                            box.padding = 0.25,
                            point.padding = 0.5,
                            segment.color = "grey50") +
  xlab(expression("Better Attack" %->% "")) +
  ylab(expression("Better Defense" %->% "")) + 
  ggtitle("Strengths of every team") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

Predictions

Euro

Now, we predict the scores of the future games, i.e, the semifinals and the final.

test_pred <- test_euro %>% 
  select(id, team, adversary, condition, ranking_adv)

predictions <- posterior_predict(model_euro, newdata = test_pred) 

mean_goals <- apply(predictions, 2, mean)

test_pred <- test_pred %>% 
  ungroup() %>% 
  mutate(goals = mean_goals) %>% 
  group_by(id) %>% 
  summarise("Team 1" = first(team),
            "Exp. Goals" = first(goals),
            "Team 2" = last(team),
            "Exp. Goals 2" = last(goals)) 

The predictions of future games are:

test_pred %>% 
  print(n = 1e3)
# A tibble: 77 × 5
      id `Team 1`       `Exp. Goals` `Team 2`       `Exp. Goals 2`
   <int> <chr>                 <dbl> <chr>                   <dbl>
 1   103 France                0.402 Spain                   0.842
 2   104 England               1.89  Netherlands             1.60 
 3   105 England               0.626 Spain                   1.53 
 4   106 Croatia               1.53  Portugal                1.30 
 5   107 Poland                2.16  Scotland                0.940
 6   108 Denmark               0.748 Switzerland             0.941
 7   109 Serbia                0.359 Spain                   1.88 
 8   110 France                1.29  Italy                   0.644
 9   111 Austria               1.86  Slovenia                1.32 
10   112 Germany               2.17  Hungary                 0.889
11   113 Czech Republic        1.81  Georgia                 1.88 
12   114 Albania               1.38  Ukraine                 0.789
13   115 Portugal              2.26  Scotland                0.754
14   116 Croatia               1.95  Poland                  1.24 
15   117 Denmark               1.08  Serbia                  0.487
16   118 Spain                 1.30  Switzerland             0.669
17   119 Belgium               0.669 France                  0.835
18   120 Germany               2.02  Netherlands             2.01 
19   121 Albania               4.40  Georgia                 1.18 
20   122 Czech Republic        1.09  Ukraine                 1.38 
21   123 Belgium               1.12  Italy                   0.858
22   124 Hungary               1.74  Netherlands             2.43 
23   125 Albania               1.92  Czech Republic          0.864
24   126 Georgia               1.09  Ukraine                 1.97 
25   127 Croatia               3.50  Scotland                0.75 
26   128 Poland                1.46  Portugal                1.12 
27   129 Serbia                0.532 Switzerland             1.20 
28   130 Denmark               0.433 Spain                   1.75 
29   131 Belgium               0.808 France                  0.715
30   132 Germany               2.40  Netherlands             1.67 
31   133 Albania               3.84  Georgia                 1.40 
32   134 Czech Republic        0.709 Ukraine                 1.07 
33   135 Portugal              2.10  Scotland                0.891
34   136 Croatia               1.61  Poland                  1.46 
35   137 Serbia                0.307 Spain                   2.24 
36   138 Denmark               0.619 Switzerland             1.08 
37   139 Belgium               1.34  Italy                   0.712
38   140 Poland                1.24  Portugal                1.36 
39   141 Croatia               2.95  Scotland                0.921
40   142 Denmark               0.523 Spain                   1.47 
41   143 Serbia                0.446 Switzerland             1.42 
42   144 Hungary               1.42  Netherlands             2.90 
43   145 Albania               2.27  Czech Republic          0.713
44   146 Georgia               1.70  Ukraine                 2.59 
45   147 France                1.08  Italy                   0.793
46   148 Austria               2.21  Slovenia                1.12 
47   149 Croatia               1.84  Portugal                1.11 
48   150 Poland                2.70  Scotland                0.791
49   151 Denmark               0.933 Serbia                  0.590
50   152 Spain                 1.53  Switzerland             0.555
51   153 Germany               1.84  Hungary                 1.07 
52   154 Albania               2.21  Ukraine                 1.04 
53   155 Czech Republic        2.26  Georgia                 1.54 
54   156 Croatia               1.06  France                  1.18 
55   157 Denmark               1.17  Portugal                0.845
56   158 Germany               1.46  Italy                   1.18 
57   159 Netherlands           1.07  Spain                   2.56 
58   160 Austria               2.66  Serbia                  0.661
59   161 Hungary               1.24  Turkey                  1.80 
60   162 Belgium               1.04  Ukraine                 0.511
61   163 Slovakia              0.914 Slovenia                1.38 
62   164 Albania               1.63  England                 2.12 
63   165 Croatia               0.888 France                  1.41 
64   166 Germany               1.71  Italy                   0.989
65   167 Denmark               1.00  Portugal                0.986
66   168 Netherlands           0.903 Spain                   3.08 
67   169 Belgium               1.62  Ukraine                 0.641
68   170 Hungary               1.49  Turkey                  1.48 
69   171 Austria               2.26  Serbia                  0.787
70   172 Slovakia              0.753 Slovenia                1.65 
71   173 Germany               1.85  Portugal                0.906
72   174 France                0.401 Spain                   0.838
73   175 Austria               3.41  Romania                 1.53 
74   176 Albania               1.86  Serbia                  0.633
75   177 France                0.934 Germany                 1.04 
76   178 Portugal              0.360 Spain                   1.51 
77   179 Croatia               2.38  Czech Republic          0.825

Particullarly, the distribution of the prediction of the score for the final looks as shown below.

results_game <- tibble("England" = predictions[,5], "Spain" = predictions[,6])

df_count <- results_game %>% 
  group_by(England, Spain) %>% 
  summarise(count = n(), .groups = "keep") %>% 
  ungroup()

total_combinations <- nrow(results_game)

df_count <- df_count %>% 
    mutate(proportion = count/total_combinations)
  
ggplot(df_count, aes(x = England, y = Spain, fill = proportion)) +
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "blue") + 
  labs(title = "Heatmap of Goals Scored by the two teams",
       x = "England", 
       y = "Spain",
       fill = "Probability") + 
  theme_minimal()

According to this, it is likely that the final will be Spain vs. England. Spain is expected to be the champion, with a predicted score of 2-0.

Copa America

Similarly, we can generate the predictions for the last round of the games in Copa América

test_pred <- test_america %>% 
  select(id, team, adversary, condition, ranking_adv)

predictions <- posterior_predict(model_america, newdata = test_pred) 

mean_goals <- apply(predictions, 2, mean)

test_pred <- test_pred %>% 
  ungroup() %>% 
  mutate(goals = mean_goals) %>% 
  group_by(id) %>% 
  summarise("Team 1" = first(team),
            "Exp. Goals" = first(goals),
            "Team 2" = last(team),
            "Exp. Goals 2" = last(goals)) 
test_pred %>% 
  print(n = 1e3)
# A tibble: 70 × 5
      id `Team 1`      `Exp. Goals` `Team 2`      `Exp. Goals 2`
   <int> <chr>                <dbl> <chr>                  <dbl>
 1    96 Argentina           1.92   Canada                0.416 
 2    97 Colombia            0.874  Uruguay               1.34  
 3    98 Canada              0.498  Uruguay               2.57  
 4    99 Argentina           1.03   Colombia              0.766 
 5   100 Bolivia             0.554  Venezuela             1.62  
 6   101 Argentina           1.09   Chile                 0.171 
 7   102 Colombia            1.25   Peru                  0.109 
 8   103 Paraguay            0.171  Uruguay               2.29  
 9   104 Brazil              1.58   Ecuador               0.728 
10   105 United States       2.06   United States         2.06  
11   106 Canada              1.01   Mexico                1.80  
12   107 Argentina           0.888  Colombia              0.932 
13   108 Ecuador             1.26   Peru                  0.0765
14   109 Bolivia             0.258  Chile                 1.11  
15   110 Uruguay             1.33   Venezuela             0.754 
16   111 Brazil              1.17   Paraguay              0.405 
17   112 Bolivia             0.481  Colombia              2.89  
18   113 Ecuador             1.57   Paraguay              0.281 
19   114 Argentina           0.918  Venezuela             0.608 
20   115 Brazil              0.649  Chile                 0.543 
21   116 Peru                0.0716 Uruguay               1.30  
22   117 United States       2.72   United States         2.72  
23   118 Canada              2.79   Panama                1.06  
24   119 United States       0.826  United States         0.826 
25   120 Chile               0.326  Colombia              1.19  
26   121 Paraguay            0.396  Venezuela             1.08  
27   122 Ecuador             0.450  Uruguay               1.98  
28   123 Argentina           3.58   Bolivia               0.164 
29   124 Brazil              1.61   Peru                  0.0758
30   125 Costa Rica          1.60   Panama                1.00  
31   126 United States       1.34   United States         1.34  
32   127 Brazil              0.929  Venezuela             1.22  
33   128 Bolivia             0.339  Ecuador               2.74  
34   129 Argentina           1.28   Paraguay              0.210 
35   130 Chile               0.364  Peru                  0.0775
36   131 Colombia            0.746  Uruguay               1.78  
37   132 Costa Rica          1.02   Panama                1.54  
38   133 United States       2.01   United States         2.01  
39   134 Bolivia             0.739  Paraguay              0.657 
40   135 Colombia            1.89   Ecuador               0.671 
41   136 Argentina           1.55   Peru                  0.0361
42   137 Chile               0.513  Venezuela             0.507 
43   138 Brazil              0.939  Uruguay               1.22  
44   139 United States       0.996  United States         0.996 
45   140 United States       2.12   United States         2.12  
46   141 Chile               1.44   Panama                0.325 
47   142 Canada              1.05   Mexico                1.94  
48   143 United States       2.71   United States         2.71  
49   144 Chile               0.472  Paraguay              0.262 
50   145 Bolivia             0.365  Peru                  0.273 
51   146 Brazil              1.41   Colombia              1.16  
52   147 Ecuador             1.04   Venezuela             0.782 
53   148 Argentina           0.592  Uruguay               0.961 
54   149 United States       2.22   United States         2.22  
55   150 Mexico              2.34   Panama                0.774 
56   151 Bolivia             0.339  Uruguay               3.01  
57   152 Colombia            2.51   Paraguay              0.247 
58   153 Peru                0.0707 Venezuela             1.29  
59   154 Argentina           1.44   Brazil                0.497 
60   155 Chile               0.561  Ecuador               0.488 
61   156 Brazil              1.00   Ecuador               1.15  
62   157 Paraguay            0.262  Uruguay               1.74  
63   158 Argentina           0.619  Chile                 0.274 
64   159 Colombia            1.97   Peru                  0.0684
65   160 Bolivia             0.336  Venezuela             2.99  
66   161 Bolivia             0.340  Chile                 0.848 
67   162 Uruguay             1.82   Venezuela             0.474 
68   163 Ecuador             0.761  Peru                  0.121 
69   164 Argentina           1.37   Colombia              0.584 
70   165 Brazil              2.10   Paraguay              0.256 

In this case, Argentina is expected to easily defeat Canada, and Uruguay will likely beat Colombia. However, Colombia won its game 1-0, so the final will be Argentina vs. Colombia, while the 3rd and 4th place game will be played between Canada and Uruguay.

The final is anticipated to be a close match, with the most likely result being 0-0, as indicated below. However, Argentina is expected to have a slightly higher probability of scoring more goals than Colombia.

results_game <- tibble("Argentina" = predictions[,7], "Colombia" = predictions[,8])

df_count <- results_game %>% 
  group_by(Argentina, Colombia) %>% 
  summarise(count = n(), .groups = "keep") %>% 
  ungroup()

total_combinations <- nrow(results_game)

df_count <- df_count %>% 
    mutate(proportion = count/total_combinations)
  
ggplot(df_count, aes(x = Argentina, y = Colombia, fill = proportion)) +
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "blue") + 
  labs(title = "Heatmap of Goals Scored by the two teams",
       x = "Argentina", 
       y = "Colombia",
       fill = "Probability") + 
  theme_minimal()

As we saw, the model accurately predicted the scores and captured the strengths of each team. It also produced reasonable and accurate results for the final games based on the quarter-final outcomes. Spain defeated England 2-1, while the match between Argentina and Colombia ended 0-0, with Argentina scoring a goal in extra time.

For future models, incorporating dynamic strengths could improve accuracy.

References

Florez, Mauro, Michele Guindani, and Marina Vannucci. 2024. “Bayesian Bivariate ConwayMaxwellPoisson Regression Model for Correlated Count Data in Sports.” Journal of Quantitative Analysis in Sports 0 (0). https://doi.org/10.1515/jqas-2024-0072.