1.1 Where and how can we get the data?

1.1.1 Data retrieval

Generally, we would like to know who (country, continent) hosted when. Since hosting is a lengthy process that starts by bidding and followed by FIFA evaluation. it would be interesting to incorporate bidding data into the analysis.

In this project we will use the data made available in this Wikipedia article about FIFA World Cup hosts

To do that, we are going to use the rvest package to explore and scrape this tables directly into R.

# URL of the article
url <- "https://en.wikipedia.org/wiki/FIFA_World_Cup_hosts"
# Read the webpage and obtain the pieces of the article containing tables
tbls_lst <-  url %>%
  read_html %>%
  html_table()
#number of retrieved tables
length(tbls_lst)
## [1] 15

We’ve scrapped the Wikipedia article and parsed all the tables, 15 in total! hmm, we don’t need all of them for our analysis. Let’s select only the tables of interest for this tutorial. We’ll limit our data analysis to the subset of tables showing the list of countries that have submitted a bid or actually hosted the world cup and the performance of host countries in our analysis.

# Select tables of interest
tbls_lst <- tbls_lst[c(1,9,10)]

# Assign names to the tables
tables_names <- c("List of hosts",
                  "Total bids by country",
                  "Host country performances")
names(tbls_lst) <- tolower(tables_names) %>% str_replace_all(" ","_")

Let’s have a quick look at the top of the selected tables

gt::gt(head(tbls_lst$list_of_hosts))
Year Host nation(s) Continent
1930 Uruguay South America
1934 Italy Europe
1938 France Europe
1942 Cancelled because of World War II Cancelled because of World War II
1946 Cancelled because of World War II Cancelled because of World War II
1950 Brazil South America
gt::gt(head(tbls_lst$total_bids_by_country))
Country Bids Years Times hosted
Germany 8 1938, 1962,[a] 1966,[a]1974,[a]1982,[a]1990,[a]1998, 2006 2
Argentina 5 1938, 1962, 1970, 1978, 2014 1
England 5 1966, 1990, 1998, 2006, 2018 1
Italy 5 1930, 1934, 1974, 1982, 1990 2
Mexico 5 1970, 1978, 1986,[b] 2002, 2026[c] 3
Morocco 5 1994, 1998, 2006, 2010, 2026 0
gt::gt(head(tbls_lst$host_country_performances))
Year Team Result Note Pld W D L GF GA GD
1930 Uruguay Champions Best result, later equalled 4 4 0 0 15 3 +12
1934 Italy Champions Best result, later equalled 5 4 1 0 12 3 +9
1938 France Quarter-finals Best result, later improved 2 1 0 1 4 4 0
1950 Brazil Runners-up Best result, later improved 6 4 1 1 22 6 +16
1954 Switzerland Quarter-finals Equalled best result 4 2 0 2 11 11 0
1958 Sweden Runners-up Best result 6 4 1 1 12 7 +5

Looks good! Next, we’ll have a deeper look at the data to insure that everything is in the right place.

1.1.2 Data quality control

Parsing data from Web is not a perfect process due to different formatting. We’ll start by having a visual inspection of the tables using the package visdat and combine all the tables in a single plot.

#Visualize the content of the tables
(vis_dat_lst <- lapply(tbls_lst,visdat::vis_dat))
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
##   Please report the issue at <https://github.com/ropensci/visdat/issues>.
## $list_of_hosts

## 
## $total_bids_by_country

## 
## $host_country_performances

#Add title of the data to the plot
(vis_dat_lst <- lapply(names(vis_dat_lst), function(dat_name){
  #subset the plot of by name
  vis <- vis_dat_lst[[dat_name]]
  #add title
  vis + 
    labs(title = dat_name)
  }))
## [[1]]

## 
## [[2]]

## 
## [[3]]

#combine all the tables in a single plot
vis_dat_lst %>%
  patchwork::wrap_plots() &
  theme(legend.position = "bottom")

The plots reveal two issues. First, the type of the column “Years” in the second table is character! We’ll fix this later, but now let’s deal with the second issue. The third table shows missing data in many columns! Let’s have a deeper look on this table

gt::gt(tbls_lst[[3]])
Year Team Result Note Pld W D L GF GA GD
1930 Uruguay Champions Best result, later equalled 4 4 0 0 15 3 +12
1934 Italy Champions Best result, later equalled 5 4 1 0 12 3 +9
1938 France Quarter-finals Best result, later improved 2 1 0 1 4 4 0
1950 Brazil Runners-up Best result, later improved 6 4 1 1 22 6 +16
1954 Switzerland Quarter-finals Equalled best result 4 2 0 2 11 11 0
1958 Sweden Runners-up Best result 6 4 1 1 12 7 +5
1962 Chile Third place Best result 6 4 0 2 10 8 +2
1966 England Champions Best result 6 5 1 0 11 3 +8
1970 Mexico Quarter-finals Best result, later equalled (again as hosts) 4 2 1 1 6 4 +2
1974 West Germany Champions Equalled best result, later equalled again 7 6 0 1 13 4 +9
1978 Argentina Champions Best result, later equalled 7 5 1 1 15 4 +9
1982 Spain Second round (top 12) 5 1 2 2 4 5 −1
1986 Mexico Quarter-finals Equalled best result (previous time again as hosts) 5 3 2 0 6 2 +4
1990 Italy Third place 7 6 1 0 10 2 +8
1994 United States Round of 16 4 1 1 2 3 4 −1
1998 France Champions Best result, later equalled 7 6 1 0 15 2 +13
2002 South Korea Fourth place Best result 7 3 2 2 8 6 +2
2002 Japan Round of 16 Best result, later equalled 4 2 1 1 5 3 +2
2006 Germany Third place 7 5 1 1 14 6 +8
2010 South Africa First round Equalled best (and worst) result 3 1 1 1 3 5 −2
2014 Brazil Fourth place 7 3 2 2 11 14 −3
2018 Russia Quarter-finals Best result as independent nation 5 2 2 1 11 7 +4
2022 Qatar First round Best result 3 0 0 3 1 7 -6
2026 Canada TBD NA NA NA NA NA NA
2026 Mexico TBD NA NA NA NA NA NA
2026 United States TBD NA NA NA NA NA NA

Apparently, the missing values are yet-to-be-determined performances of the hosting countries in 2026! Therefore, this shouldn’t be a concern and is not a failure of parsing the table.

Before rushing to the analysis, let’s push the tables through a few rounds of quality control.

let’s start by cleaning column names by handling special characters, spaces, and applying a consistent format.

#names of columns before cleaning
lapply(tbls_lst, colnames)
## $list_of_hosts
## [1] "Year"           "Host nation(s)" "Continent"     
## 
## $total_bids_by_country
## [1] "Country"       "Bids"          "Years"         "Times  hosted"
## 
## $host_country_performances
##  [1] "Year"   "Team"   "Result" "Note"   "Pld"    "W"      "D"      "L"     
##  [9] "GF"     "GA"     "GD"
# Clean columns' names
tbls_lst <- lapply(tbls_lst,  janitor::clean_names)
#names of columns after cleaning
lapply(tbls_lst, colnames)
## $list_of_hosts
## [1] "year"          "host_nation_s" "continent"    
## 
## $total_bids_by_country
## [1] "country"      "bids"         "years"        "times_hosted"
## 
## $host_country_performances
##  [1] "year"   "team"   "result" "note"   "pld"    "w"      "d"      "l"     
##  [9] "gf"     "ga"     "gd"

One can see that the column with the countries has a different name (“host_nation_s”, “country”, “team”) in each table. Let’s fix this inconsistency and set an new name ( “country_name”) to all of them.

#old inconsistent names
cols_old <- c("country", "team", "host_nation_s")
#new column name
col_new <- "country_name"
#apply the replacement
(tbls_lst <- lapply(tbls_lst, function(tbl){
  tbl %>%
    rename_with(~ ifelse(.x %in% cols_old,
                         col_new,
                         .x))
  }))
## $list_of_hosts
## # A tibble: 25 × 3
##     year country_name                      continent                        
##    <int> <chr>                             <chr>                            
##  1  1930 Uruguay                           South America                    
##  2  1934 Italy                             Europe                           
##  3  1938 France                            Europe                           
##  4  1942 Cancelled because of World War II Cancelled because of World War II
##  5  1946 Cancelled because of World War II Cancelled because of World War II
##  6  1950 Brazil                            South America                    
##  7  1954 Switzerland                       Europe                           
##  8  1958 Sweden                            Europe                           
##  9  1962 Chile                             South America                    
## 10  1966 England                           Europe                           
## # … with 15 more rows
## 
## $total_bids_by_country
## # A tibble: 35 × 4
##    country_name   bids years                                             times…¹
##    <chr>         <int> <chr>                                               <int>
##  1 Germany           8 1938, 1962,[a] 1966,[a]1974,[a]1982,[a]1990,[a]1…       2
##  2 Argentina         5 1938, 1962, 1970, 1978, 2014                            1
##  3 England           5 1966, 1990, 1998, 2006, 2018                            1
##  4 Italy             5 1930, 1934, 1974, 1982, 1990                            2
##  5 Mexico            5 1970, 1978, 1986,[b] 2002, 2026[c]                      3
##  6 Morocco           5 1994, 1998, 2006, 2010, 2026                            0
##  7 Spain             5 1930, 1966, 1974, 1982, 2018[d]                         1
##  8 Brazil            4 1950, 1994, 2006, 2014                                  2
##  9 Colombia          4 1970, 1978, 1986,[b]2014                                1
## 10 United States     4 1986, 1994, 2022, 2026[c]                               2
## # … with 25 more rows, and abbreviated variable name ¹​times_hosted
## 
## $host_country_performances
## # A tibble: 26 × 11
##     year country_name result     note    pld     w     d     l    gf    ga gd   
##    <int> <chr>        <chr>      <chr> <int> <int> <int> <int> <int> <int> <chr>
##  1  1930 Uruguay      Champions  Best…     4     4     0     0    15     3 +12  
##  2  1934 Italy        Champions  Best…     5     4     1     0    12     3 +9   
##  3  1938 France       Quarter-f… Best…     2     1     0     1     4     4 0    
##  4  1950 Brazil       Runners-up Best…     6     4     1     1    22     6 +16  
##  5  1954 Switzerland  Quarter-f… Equa…     4     2     0     2    11    11 0    
##  6  1958 Sweden       Runners-up Best…     6     4     1     1    12     7 +5   
##  7  1962 Chile        Third pla… Best…     6     4     0     2    10     8 +2   
##  8  1966 England      Champions  Best…     6     5     1     0    11     3 +8   
##  9  1970 Mexico       Quarter-f… Best…     4     2     1     1     6     4 +2   
## 10  1974 West Germany Champions  Equa…     7     6     0     1    13     4 +9   
## # … with 16 more rows

Similarly, the year column is called “years” in the second table. Let’s make it consistent with the other tables and rename it to “year”.

tbls_lst$total_bids_by_country <- tbls_lst$total_bids_by_country %>% 
  dplyr::rename(year = "years")

Next, we need to insure that the data is “tidy”. Obviously, this is not the case for the table below where the column column “years” show mutliple dates concatenated in the same row

tbls_lst$total_bids_by_country
## # A tibble: 35 × 4
##    country_name   bids year                                              times…¹
##    <chr>         <int> <chr>                                               <int>
##  1 Germany           8 1938, 1962,[a] 1966,[a]1974,[a]1982,[a]1990,[a]1…       2
##  2 Argentina         5 1938, 1962, 1970, 1978, 2014                            1
##  3 England           5 1966, 1990, 1998, 2006, 2018                            1
##  4 Italy             5 1930, 1934, 1974, 1982, 1990                            2
##  5 Mexico            5 1970, 1978, 1986,[b] 2002, 2026[c]                      3
##  6 Morocco           5 1994, 1998, 2006, 2010, 2026                            0
##  7 Spain             5 1930, 1966, 1974, 1982, 2018[d]                         1
##  8 Brazil            4 1950, 1994, 2006, 2014                                  2
##  9 Colombia          4 1970, 1978, 1986,[b]2014                                1
## 10 United States     4 1986, 1994, 2022, 2026[c]                               2
## # … with 25 more rows, and abbreviated variable name ¹​times_hosted

what we need to do is to split years of bids into separate entries and convert it to numeric

#separate concatenated years into separate rows
(tbls_lst$total_bids_by_country <- tbls_lst$total_bids_by_country %>% 
  mutate(year = str_extract_all(year, "[0-9]+")) %>% 
  unnest(year) %>% 
  mutate(year = as.numeric(year)))
## # A tibble: 90 × 4
##    country_name  bids  year times_hosted
##    <chr>        <int> <dbl>        <int>
##  1 Germany          8  1938            2
##  2 Germany          8  1962            2
##  3 Germany          8  1966            2
##  4 Germany          8  1974            2
##  5 Germany          8  1982            2
##  6 Germany          8  1990            2
##  7 Germany          8  1998            2
##  8 Germany          8  2006            2
##  9 Argentina        5  1938            1
## 10 Argentina        5  1962            1
## # … with 80 more rows

and do the same thing by splitting cohosts of the same world cup (e.g. “Japan South Korea”) into separate rows entries (“Japan”, “South Korea”).

#separate cohosting countries into separate entries
tbls_lst$list_of_hosts <- tbls_lst$list_of_hosts %>% 
  mutate(country_name = str_split(country_name, "\\s{2}")) %>%
  unnest(country_name) %>% 
  dplyr::rename(host_year = "year")

Have a look on separate_rows() for another way to achieve the same effect

Next, let’s give a meaningful order to the results of the teams.

#order of the results
results_order <- c("Champions",
                   "Runners-up",
                   "Third place",
                   "Fourth place",
                   "Quarter-finals",
                   "Round of 16",
                   "Second round",
                   "First round",
                   "TBD"
                   )
#set the order
tbls_lst$host_country_performances  <- tbls_lst$host_country_performances %>%
  mutate(result = ifelse(result == "Second round (top 12)", "Second round", result),
         result = factor(result, levels = results_order))

And we’ll end this part by defining a new column with country code (isoc2).

#get iso2 code of each country (flags for Yugoslavia and England are missing)
(tbls_lst <- lapply(tbls_lst,function(tbl){
  tbl$country_code <- countrycode::countrycode(tbl$country_name,
                                            "country.name",# the provided country label
                                            "iso2c"# the country code
                                            )
  tbl
  }))
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Cancelled because of World War II, England
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: England, Yugoslavia
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: England
## $list_of_hosts
## # A tibble: 28 × 4
##    host_year country_name                      continent                 count…¹
##        <int> <chr>                             <chr>                     <chr>  
##  1      1930 Uruguay                           South America             UY     
##  2      1934 Italy                             Europe                    IT     
##  3      1938 France                            Europe                    FR     
##  4      1942 Cancelled because of World War II Cancelled because of Wor… <NA>   
##  5      1946 Cancelled because of World War II Cancelled because of Wor… <NA>   
##  6      1950 Brazil                            South America             BR     
##  7      1954 Switzerland                       Europe                    CH     
##  8      1958 Sweden                            Europe                    SE     
##  9      1962 Chile                             South America             CL     
## 10      1966 England                           Europe                    <NA>   
## # … with 18 more rows, and abbreviated variable name ¹​country_code
## 
## $total_bids_by_country
## # A tibble: 90 × 5
##    country_name  bids  year times_hosted country_code
##    <chr>        <int> <dbl>        <int> <chr>       
##  1 Germany          8  1938            2 DE          
##  2 Germany          8  1962            2 DE          
##  3 Germany          8  1966            2 DE          
##  4 Germany          8  1974            2 DE          
##  5 Germany          8  1982            2 DE          
##  6 Germany          8  1990            2 DE          
##  7 Germany          8  1998            2 DE          
##  8 Germany          8  2006            2 DE          
##  9 Argentina        5  1938            1 AR          
## 10 Argentina        5  1962            1 AR          
## # … with 80 more rows
## 
## $host_country_performances
## # A tibble: 26 × 12
##     year countr…¹ result note    pld     w     d     l    gf    ga gd    count…²
##    <int> <chr>    <fct>  <chr> <int> <int> <int> <int> <int> <int> <chr> <chr>  
##  1  1930 Uruguay  Champ… Best…     4     4     0     0    15     3 +12   UY     
##  2  1934 Italy    Champ… Best…     5     4     1     0    12     3 +9    IT     
##  3  1938 France   Quart… Best…     2     1     0     1     4     4 0     FR     
##  4  1950 Brazil   Runne… Best…     6     4     1     1    22     6 +16   BR     
##  5  1954 Switzer… Quart… Equa…     4     2     0     2    11    11 0     CH     
##  6  1958 Sweden   Runne… Best…     6     4     1     1    12     7 +5    SE     
##  7  1962 Chile    Third… Best…     6     4     0     2    10     8 +2    CL     
##  8  1966 England  Champ… Best…     6     5     1     0    11     3 +8    <NA>   
##  9  1970 Mexico   Quart… Best…     4     2     1     1     6     4 +2    MX     
## 10  1974 West Ge… Champ… Equa…     7     6     0     1    13     4 +9    DE     
## # … with 16 more rows, and abbreviated variable names ¹​country_name,
## #   ²​country_code

Quality control is not over yet! We need to manually apply some historical modification to the data.

First, let’s correct the entry of Colombia. After being chosen as a host in 1986, the country had to withdrew from hosting the cup due to economic concerns.

tbls_lst$total_bids_by_country <- tbls_lst$total_bids_by_country %>% 
  mutate(times_hosted = ifelse(country_name == "Colombia", 0, times_hosted))

Second, as Berlin Wall was brought to the ground few decades ago, let’s replace West Germany” with “Germany”.

West Germany national team

tbls_lst <- lapply(tbls_lst, function(tbl){
  tbl %>%
    mutate(across(where(is.character), #select character columns
                  ~ str_replace(.x, "West Germany", "Germany") #define replacement
                  )
           )
  })

Now that the data is analysis-ready, it is time to explore some interesting questions!