Mauro Florez
  • Home
  • Projects
  • Teaching
  • Gallery
  • Resume

On this page

  • Packages
  • Data
    • Filtering data
  • Fitting the model
  • Estimation
    • Euro
    • Copa America
  • Predictions
    • Euro
    • Copa America

Predicting Euro & Copa America - 2024

A statistical model to analyze the results of the Euro Cup and the Copa America 2024.
Published

June 30, 2024

In this project, I will propose a model for predicting the Euro & Copa America - 2024 scores based on the model presented in (Florez, Guindani, and Vannucci 2024).

Packages

I will use a Bayesian model, so we need the following libraries

require(rstanarm)
require(rstantools)
require(bayesplot)

As well as

require(readr)
require(dplyr)
require(lubridate)
require(ggplot2)
require(ggrepel)
require(latex2exp)

Data

The data is contained in this repository https://github.com/martj42/international_results.

urlfile = "https://raw.githubusercontent.com/martj42/international_results/master/results.csv"

We load the data

data <- read_csv(url(urlfile), show_col_types = FALSE)

The proposed model is based on the teams’ strengths attacking and defending. To train the model, we used the scores of games played since season 2023-2024.

euro_2024 = c('Albania','Austria','Belgium','Croatia','Czech Republic','Denmark','England','France','Georgia','Germany','Hungary','Italy','Netherlands','Poland','Portugal','Romania','Scotland','Serbia','Slovakia','Slovenia','Spain','Switzerland','Turkey','Ukraine')

america_2024 = sort(c('Argentina', 'Chile', 'Peru', 'Canada', 'Mexico', 'Ecuador', 'Venezuela', 'Jamaica', 'United States', 'Uruguay', 'Panama', 'Bolivia', 'Brazil', 'Colombia', 'Paraguay', 'Costa Rica'))  

scores <- data %>% 
  filter(home_team %in% euro_2024 & away_team %in% euro_2024) %>% 
  filter(date > "2023-06-01") %>% 
  mutate(id = row_number(), cup = "Euro") %>% 
  bind_rows(data %>% 
  filter(home_team %in% america_2024 & away_team %in% america_2024) %>% 
  filter(date > "2023-06-01") %>% 
  mutate(id = row_number(), cup = "America")
  )

For example, we consider the following games played by Portugal.

scores %>% 
  filter(home_team == "Portugal" | away_team == "Portugal")
# A tibble: 19 × 11
   date       home_team away_team home_score away_score tournament city  country
   <date>     <chr>     <chr>          <dbl>      <dbl> <chr>      <chr> <chr>  
 1 2023-09-08 Slovakia  Portugal           0          1 UEFA Euro… Brat… Slovak…
 2 2023-10-13 Portugal  Slovakia           3          2 UEFA Euro… Porto Portug…
 3 2024-03-26 Slovenia  Portugal           2          0 Friendly   Ljub… Sloven…
 4 2024-06-08 Portugal  Croatia            1          2 Friendly   Oeir… Portug…
 5 2024-06-18 Portugal  Czech Re…          2          1 UEFA Euro  Leip… Germany
 6 2024-06-22 Turkey    Portugal           0          3 UEFA Euro  Dort… Germany
 7 2024-06-26 Georgia   Portugal           2          0 UEFA Euro  Gels… Germany
 8 2024-07-01 Portugal  Slovenia           0          0 UEFA Euro  Fran… Germany
 9 2024-07-05 Portugal  France             0          0 UEFA Euro  Hamb… Germany
10 2024-09-05 Portugal  Croatia            2          1 UEFA Nati… Lisb… Portug…
11 2024-09-08 Portugal  Scotland           2          1 UEFA Nati… Lisb… Portug…
12 2024-10-12 Poland    Portugal           1          3 UEFA Nati… Wars… Poland 
13 2024-10-15 Scotland  Portugal           0          0 UEFA Nati… Glas… Scotla…
14 2024-11-15 Portugal  Poland             5          1 UEFA Nati… Porto Portug…
15 2024-11-18 Croatia   Portugal           1          1 UEFA Nati… Split Croatia
16 2025-03-20 Denmark   Portugal           1          0 UEFA Nati… Cope… Denmark
17 2025-03-23 Portugal  Denmark            5          2 UEFA Nati… Lisb… Portug…
18 2025-06-04 Germany   Portugal           1          2 UEFA Nati… Muni… Germany
19 2025-06-08 Portugal  Spain              2          2 UEFA Nati… Muni… Germany
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>

and the following games played by Colombia

scores %>% 
  filter(home_team == "Colombia" | away_team == "Colombia")
# A tibble: 25 × 11
   date       home_team away_team home_score away_score tournament city  country
   <date>     <chr>     <chr>          <dbl>      <dbl> <chr>      <chr> <chr>  
 1 2023-09-07 Colombia  Venezuela          1          0 FIFA Worl… Barr… Colomb…
 2 2023-09-12 Chile     Colombia           0          0 FIFA Worl… Sant… Chile  
 3 2023-10-12 Colombia  Uruguay            2          2 FIFA Worl… Barr… Colomb…
 4 2023-10-17 Ecuador   Colombia           0          0 FIFA Worl… Quito Ecuador
 5 2023-11-16 Colombia  Brazil             2          1 FIFA Worl… Barr… Colomb…
 6 2023-11-21 Paraguay  Colombia           0          1 FIFA Worl… Asun… Paragu…
 7 2023-12-16 Mexico    Colombia           2          3 Friendly   Los … United…
 8 2024-06-08 United S… Colombia           1          5 Friendly   Land… United…
 9 2024-06-15 Colombia  Bolivia            3          0 Friendly   Denv… United…
10 2024-06-24 Colombia  Paraguay           2          1 Copa Amér… Hous… United…
# ℹ 15 more rows
# ℹ 3 more variables: neutral <lgl>, id <int>, cup <chr>

Filtering data

home_scores <- scores %>% 
  select(id, score = home_score, team = home_team, adversary = away_team, date = date, cup) %>% 
  mutate(condition = ifelse(scores$neutral, "neutral", "home"))
  
away_scores <- scores %>% 
  select(id, score = away_score, team = away_team, adversary = home_team, date = date, cup) %>% 
  mutate(condition = ifelse(scores$neutral, "neutral", "away"))

total_scores <- union_all(home_scores, away_scores)

We also consider the ranking FIFA of the teams as covariate. For every game, select the up-to-date ranking

path0 <- "https://raw.githubusercontent.com/Dato-Futbol/fifa-ranking/master/ranking_fifa_historical.csv"
rank <- read_csv(url(path0), show_col_types = FALSE)

rank <- rank %>% 
  select(team, total_points, date) %>% 
  group_by(date) %>% 
  mutate(ranking = rank(-total_points))

Merging the data will be

scores_rank <- total_scores %>% 
  left_join(rank, by = join_by(adversary == team), relationship = "many-to-many") %>% 
  filter(date.x >= date.y) %>% 
  mutate(diff_date = date.x - date.y) %>% 
  group_by(id, team) %>% 
  slice_min(diff_date) %>% 
  select(id, score, team, adversary, condition, date_game = date.x, ranking_adv = ranking, date_rank = date.y, cup)

We are separating the data for both Euro and Copa America. For both cases, we will use as test data the games played after the quarter-finals. i.e, we will finish predicting the results of the semi-finals and the finals.

date_euro = "2024-07-09"
date_ca = "2024-07-09"

train_euro <- scores_rank %>% 
  filter(date_game < date_euro, cup == "Euro")

train_america <- scores_rank %>% 
  filter(date_game < date_ca, cup == "America")

test_euro <- scores_rank %>% 
  filter(date_game >= date_euro, cup == "Euro")

test_america <- scores_rank %>% 
  filter(date_game >= date_ca, cup == "America")

Fitting the model

To fit the model I would consider the scores are distributed as Poisson for simplicity, one better option is the Conway-Maxwell-Poisson as shown in (Florez, Guindani, and Vannucci 2024). However, we will use a simple model for computational reasons as we constantly update the model.

model_euro <- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_euro, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum"))
model_america <- stan_glm(score ~ 1 + team + adversary + condition + ranking_adv, data = train_america, family = poisson, iter = 5000, contrasts = list(team = "contr.sum", adversary = "contr.sum"))

Estimation

Euro

strength_att <- model_euro$coefficients[2:24]
strength_att <- c(strength_att, 0 - sum(strength_att))

strength_def <- model_euro$coefficients[25:47]
strength_def <- c(strength_def, 0 - sum(strength_def))

coeffs <- data.frame(team = euro_2024, strength_att, strength_def, row.names = c())

Now, we can plot the estimated strengths in the following way:

ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_point() +
  geom_label_repel(aes(label = team),
                            box.padding = 0.25,
                            point.padding = 0.5,
                            segment.color = "grey50") +
  xlab(expression("Better Attack" %->% "")) +
  ylab(expression("Better Defense" %->% "")) + 
  ggtitle("Strengths of every team") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

This graph provides an intuition into the strengths and abilities of every team. Teams located in the origin have an overall performance, and below the axis are worse than the averages’ team.

Copa America

Similarly for Copa america

n <- length(america_2024)
strength_att <- model_america$coefficients[2:n]
strength_att <- c(strength_att, 0 - sum(strength_att))

strength_def <- model_america$coefficients[(n+1):(2*n-1)]
strength_def <- c(strength_def, 0 - sum(strength_def))

coeffs <- data.frame(team = america_2024, strength_att, strength_def, row.names = c())

Now, we can plot the estimated strengths in the following way:

ggplot(coeffs, aes(x = strength_att, y = -strength_def)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_point() +
  geom_label_repel(aes(label = team),
                            box.padding = 0.25,
                            point.padding = 0.5,
                            segment.color = "grey50") +
  xlab(expression("Better Attack" %->% "")) +
  ylab(expression("Better Defense" %->% "")) + 
  ggtitle("Strengths of every team") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

Predictions

Euro

Now, we predict the scores of the future games, i.e, the semifinals and the final.

test_pred <- test_euro %>% 
  select(id, team, adversary, condition, ranking_adv)

predictions <- posterior_predict(model_euro, newdata = test_pred) 

mean_goals <- apply(predictions, 2, mean)

test_pred <- test_pred %>% 
  ungroup() %>% 
  mutate(goals = mean_goals) %>% 
  group_by(id) %>% 
  summarise("Team 1" = first(team),
            "Exp. Goals" = first(goals),
            "Team 2" = last(team),
            "Exp. Goals 2" = last(goals)) 

The predictions of future games are:

test_pred %>% 
  print(n = 1e3)
# A tibble: 77 × 5
      id `Team 1`       `Exp. Goals` `Team 2`       `Exp. Goals 2`
   <int> <chr>                 <dbl> <chr>                   <dbl>
 1   103 France                0.393 Spain                   0.858
 2   104 England               1.90  Netherlands             1.60 
 3   105 England               0.632 Spain                   1.55 
 4   106 Croatia               1.55  Portugal                1.32 
 5   107 Poland                2.21  Scotland                0.928
 6   108 Denmark               0.727 Switzerland             0.935
 7   109 Serbia                0.356 Spain                   1.88 
 8   110 France                1.29  Italy                   0.662
 9   111 Austria               1.85  Slovenia                1.33 
10   112 Germany               2.19  Hungary                 0.879
11   113 Czech Republic        1.79  Georgia                 1.87 
12   114 Albania               1.38  Ukraine                 0.784
13   115 Portugal              2.33  Scotland                0.765
14   116 Croatia               1.94  Poland                  1.25 
15   117 Denmark               1.06  Serbia                  0.499
16   118 Spain                 1.25  Switzerland             0.645
17   119 Belgium               0.666 France                  0.851
18   120 Germany               2.02  Netherlands             1.99 
19   121 Albania               4.41  Georgia                 1.19 
20   122 Czech Republic        1.07  Ukraine                 1.38 
21   123 Belgium               1.12  Italy                   0.869
22   124 Hungary               1.71  Netherlands             2.47 
23   125 Albania               1.92  Czech Republic          0.851
24   126 Georgia               1.10  Ukraine                 1.95 
25   127 Croatia               3.61  Scotland                0.759
26   128 Poland                1.50  Portugal                1.16 
27   129 Serbia                0.512 Switzerland             1.22 
28   130 Denmark               0.428 Spain                   1.76 
29   131 Belgium               0.797 France                  0.710
30   132 Germany               2.41  Netherlands             1.67 
31   133 Albania               3.75  Georgia                 1.42 
32   134 Czech Republic        0.678 Ukraine                 1.07 
33   135 Portugal              2.14  Scotland                0.918
34   136 Croatia               1.64  Poland                  1.47 
35   137 Serbia                0.292 Spain                   2.30 
36   138 Denmark               0.619 Switzerland             1.09 
37   139 Belgium               1.33  Italy                   0.717
38   140 Poland                1.23  Portugal                1.38 
39   141 Croatia               3.06  Scotland                0.906
40   142 Denmark               0.507 Spain                   1.48 
41   143 Serbia                0.432 Switzerland             1.44 
42   144 Hungary               1.43  Netherlands             2.93 
43   145 Albania               2.25  Czech Republic          0.706
44   146 Georgia               1.69  Ukraine                 2.58 
45   147 France                1.10  Italy                   0.804
46   148 Austria               2.22  Slovenia                1.10 
47   149 Croatia               1.84  Portugal                1.12 
48   150 Poland                2.86  Scotland                0.768
49   151 Denmark               0.928 Serbia                  0.580
50   152 Spain                 1.50  Switzerland             0.536
51   153 Germany               1.85  Hungary                 1.05 
52   154 Albania               2.16  Ukraine                 1.05 
53   155 Czech Republic        2.19  Georgia                 1.53 
54   156 Croatia               1.06  France                  1.21 
55   157 Denmark               1.18  Portugal                0.831
56   158 Germany               1.46  Italy                   1.16 
57   159 Netherlands           1.06  Spain                   2.59 
58   160 Austria               2.70  Serbia                  0.652
59   161 Hungary               1.26  Turkey                  1.78 
60   162 Belgium               1.05  Ukraine                 0.515
61   163 Slovakia              0.898 Slovenia                1.36 
62   164 Albania               1.65  England                 2.12 
63   165 Croatia               0.884 France                  1.40 
64   166 Germany               1.70  Italy                   0.978
65   167 Denmark               0.978 Portugal                0.985
66   168 Netherlands           0.877 Spain                   3.10 
67   169 Belgium               1.61  Ukraine                 0.662
68   170 Hungary               1.48  Turkey                  1.51 
69   171 Austria               2.29  Serbia                  0.772
70   172 Slovakia              0.759 Slovenia                1.62 
71   173 Germany               1.86  Portugal                0.914
72   174 France                0.399 Spain                   0.866
73   175 Austria               3.40  Romania                 1.52 
74   176 Albania               1.88  Serbia                  0.62 
75   177 France                0.942 Germany                 1.01 
76   178 Portugal              0.376 Spain                   1.52 
77   179 Croatia               2.37  Czech Republic          0.804

Particullarly, the distribution of the prediction of the score for the final looks as shown below.

results_game <- tibble("England" = predictions[,5], "Spain" = predictions[,6])

df_count <- results_game %>% 
  group_by(England, Spain) %>% 
  summarise(count = n(), .groups = "keep") %>% 
  ungroup()

total_combinations <- nrow(results_game)

df_count <- df_count %>% 
    mutate(proportion = count/total_combinations)
  
ggplot(df_count, aes(x = England, y = Spain, fill = proportion)) +
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "blue") + 
  labs(title = "Heatmap of Goals Scored by the two teams",
       x = "England", 
       y = "Spain",
       fill = "Probability") + 
  theme_minimal()

According to this, it is likely that the final will be Spain vs. England. Spain is expected to be the champion, with a predicted score of 2-0.

Copa America

Similarly, we can generate the predictions for the last round of the games in Copa América

test_pred <- test_america %>% 
  select(id, team, adversary, condition, ranking_adv)

predictions <- posterior_predict(model_america, newdata = test_pred) 

mean_goals <- apply(predictions, 2, mean)

test_pred <- test_pred %>% 
  ungroup() %>% 
  mutate(goals = mean_goals) %>% 
  group_by(id) %>% 
  summarise("Team 1" = first(team),
            "Exp. Goals" = first(goals),
            "Team 2" = last(team),
            "Exp. Goals 2" = last(goals)) 
test_pred %>% 
  print(n = 1e3)
# A tibble: 74 × 5
      id `Team 1`      `Exp. Goals` `Team 2`      `Exp. Goals 2`
   <int> <chr>                <dbl> <chr>                  <dbl>
 1    96 Argentina           1.93   Canada                0.433 
 2    97 Colombia            0.872  Uruguay               1.33  
 3    98 Canada              0.498  Uruguay               2.59  
 4    99 Argentina           0.989  Colombia              0.766 
 5   100 Bolivia             0.544  Venezuela             1.62  
 6   101 Argentina           1.08   Chile                 0.181 
 7   102 Colombia            1.27   Peru                  0.108 
 8   103 Paraguay            0.173  Uruguay               2.32  
 9   104 Brazil              1.60   Ecuador               0.728 
10   105 United States       2.10   United States         2.10  
11   106 Canada              1.04   Mexico                1.80  
12   107 Argentina           0.888  Colombia              0.952 
13   108 Ecuador             1.25   Peru                  0.0772
14   109 Bolivia             0.256  Chile                 1.14  
15   110 Uruguay             1.30   Venezuela             0.761 
16   111 Brazil              1.18   Paraguay              0.416 
17   112 Bolivia             0.499  Colombia              2.92  
18   113 Ecuador             1.56   Paraguay              0.285 
19   114 Argentina           0.899  Venezuela             0.609 
20   115 Brazil              0.654  Chile                 0.537 
21   116 Peru                0.0724 Uruguay               1.32  
22   117 United States       2.68   United States         2.68  
23   118 Canada              2.76   Panama                1.06  
24   119 United States       0.819  United States         0.819 
25   120 Chile               0.324  Colombia              1.20  
26   121 Paraguay            0.399  Venezuela             1.08  
27   122 Ecuador             0.46   Uruguay               1.98  
28   123 Argentina           3.58   Bolivia               0.170 
29   124 Brazil              1.60   Peru                  0.0665
30   125 Costa Rica          1.59   Panama                0.996 
31   126 United States       1.33   United States         1.33  
32   127 Brazil              0.938  Venezuela             1.22  
33   128 Bolivia             0.359  Ecuador               2.72  
34   129 Argentina           1.29   Paraguay              0.208 
35   130 Chile               0.346  Peru                  0.0751
36   131 Colombia            0.747  Uruguay               1.78  
37   132 Costa Rica          1.02   Panama                1.53  
38   133 United States       2.04   United States         2.04  
39   134 Bolivia             0.706  Paraguay              0.659 
40   135 Colombia            1.92   Ecuador               0.676 
41   136 Argentina           1.50   Peru                  0.0372
42   137 Chile               0.509  Venezuela             0.502 
43   138 Brazil              0.951  Uruguay               1.21  
44   139 United States       0.990  United States         0.990 
45   140 United States       2.10   United States         2.10  
46   141 Chile               1.42   Panama                0.329 
47   142 Canada              1.05   Mexico                1.93  
48   143 United States       2.68   United States         2.68  
49   144 Chile               0.465  Paraguay              0.267 
50   145 Bolivia             0.355  Peru                  0.275 
51   146 Brazil              1.42   Colombia              1.16  
52   147 Ecuador             1.06   Venezuela             0.778 
53   148 Argentina           0.588  Uruguay               0.954 
54   149 United States       2.21   United States         2.21  
55   150 Mexico              2.31   Panama                0.739 
56   151 Bolivia             0.331  Uruguay               2.99  
57   152 Colombia            2.49   Paraguay              0.255 
58   153 Peru                0.0665 Venezuela             1.27  
59   154 Argentina           1.44   Brazil                0.502 
60   155 Chile               0.563  Ecuador               0.492 
61   156 Brazil              1.02   Ecuador               1.13  
62   157 Paraguay            0.260  Uruguay               1.71  
63   158 Argentina           0.611  Chile                 0.265 
64   159 Colombia            1.97   Peru                  0.068 
65   160 Bolivia             0.315  Venezuela             3.01  
66   161 Bolivia             0.335  Chile                 0.850 
67   162 Uruguay             1.85   Venezuela             0.486 
68   163 Ecuador             0.774  Peru                  0.116 
69   164 Argentina           1.36   Colombia              0.615 
70   165 Brazil              2.09   Paraguay              0.260 
71   166 Costa Rica          0.586  Mexico                1.78  
72   167 Jamaica             1.63   Panama                1.25  
73   168 United States       2.11   United States         2.11  
74   169 United States       1.25   United States         1.25  

In this case, Argentina is expected to easily defeat Canada, and Uruguay will likely beat Colombia. However, Colombia won its game 1-0, so the final will be Argentina vs. Colombia, while the 3rd and 4th place game will be played between Canada and Uruguay.

The final is anticipated to be a close match, with the most likely result being 0-0, as indicated below. However, Argentina is expected to have a slightly higher probability of scoring more goals than Colombia.

results_game <- tibble("Argentina" = predictions[,7], "Colombia" = predictions[,8])

df_count <- results_game %>% 
  group_by(Argentina, Colombia) %>% 
  summarise(count = n(), .groups = "keep") %>% 
  ungroup()

total_combinations <- nrow(results_game)

df_count <- df_count %>% 
    mutate(proportion = count/total_combinations)
  
ggplot(df_count, aes(x = Argentina, y = Colombia, fill = proportion)) +
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "blue") + 
  labs(title = "Heatmap of Goals Scored by the two teams",
       x = "Argentina", 
       y = "Colombia",
       fill = "Probability") + 
  theme_minimal()

As we saw, the model accurately predicted the scores and captured the strengths of each team. It also produced reasonable and accurate results for the final games based on the quarter-final outcomes. Spain defeated England 2-1, while the match between Argentina and Colombia ended 0-0, with Argentina scoring a goal in extra time.

For future models, incorporating dynamic strengths could improve accuracy.

References

Florez, Mauro, Michele Guindani, and Marina Vannucci. 2024. “Bayesian Bivariate ConwayMaxwellPoisson Regression Model for Correlated Count Data in Sports.” Journal of Quantitative Analysis in Sports 0 (0). https://doi.org/10.1515/jqas-2024-0072.