1.1 Where and how can we get the data?
1.1.1 Data retrieval
Generally, we would like to know who (country, continent) hosted when. Since hosting is a lengthy process that starts by bidding and followed by FIFA evaluation. it would be interesting to incorporate bidding data into the analysis.
In this project we will use the data made available in this Wikipedia article about FIFA World Cup hosts
To do that, we are going to use the rvest package to explore and scrape this tables directly into R.
# URL of the article
<- "https://en.wikipedia.org/wiki/FIFA_World_Cup_hosts"
url # Read the webpage and obtain the pieces of the article containing tables
<- url %>%
tbls_lst %>%
read_html html_table()
#number of retrieved tables
length(tbls_lst)
## [1] 15
We’ve scrapped the Wikipedia article and parsed all the tables, 15 in total! hmm, we don’t need all of them for our analysis. Let’s select only the tables of interest for this tutorial. We’ll limit our data analysis to the subset of tables showing the list of countries that have submitted a bid or actually hosted the world cup and the performance of host countries in our analysis.
# Select tables of interest
<- tbls_lst[c(1,9,10)]
tbls_lst
# Assign names to the tables
<- c("List of hosts",
tables_names "Total bids by country",
"Host country performances")
names(tbls_lst) <- tolower(tables_names) %>% str_replace_all(" ","_")
Let’s have a quick look at the top of the selected tables
::gt(head(tbls_lst$list_of_hosts)) gt
Year | Host nation(s) | Continent |
---|---|---|
1930 | Uruguay | South America |
1934 | Italy | Europe |
1938 | France | Europe |
1942 | Cancelled because of World War II | Cancelled because of World War II |
1946 | Cancelled because of World War II | Cancelled because of World War II |
1950 | Brazil | South America |
::gt(head(tbls_lst$total_bids_by_country)) gt
Country | Bids | Years | Times hosted |
---|---|---|---|
Germany | 8 | 1938, 1962,[a] 1966,[a]1974,[a]1982,[a]1990,[a]1998, 2006 | 2 |
Argentina | 5 | 1938, 1962, 1970, 1978, 2014 | 1 |
England | 5 | 1966, 1990, 1998, 2006, 2018 | 1 |
Italy | 5 | 1930, 1934, 1974, 1982, 1990 | 2 |
Mexico | 5 | 1970, 1978, 1986,[b] 2002, 2026[c] | 3 |
Morocco | 5 | 1994, 1998, 2006, 2010, 2026 | 0 |
::gt(head(tbls_lst$host_country_performances)) gt
Year | Team | Result | Note | Pld | W | D | L | GF | GA | GD |
---|---|---|---|---|---|---|---|---|---|---|
1930 | Uruguay | Champions | Best result, later equalled | 4 | 4 | 0 | 0 | 15 | 3 | +12 |
1934 | Italy | Champions | Best result, later equalled | 5 | 4 | 1 | 0 | 12 | 3 | +9 |
1938 | France | Quarter-finals | Best result, later improved | 2 | 1 | 0 | 1 | 4 | 4 | 0 |
1950 | Brazil | Runners-up | Best result, later improved | 6 | 4 | 1 | 1 | 22 | 6 | +16 |
1954 | Switzerland | Quarter-finals | Equalled best result | 4 | 2 | 0 | 2 | 11 | 11 | 0 |
1958 | Sweden | Runners-up | Best result | 6 | 4 | 1 | 1 | 12 | 7 | +5 |
Looks good! Next, we’ll have a deeper look at the data to insure that everything is in the right place.
1.1.2 Data quality control
Parsing data from Web is not a perfect process due to different formatting. We’ll start by having a visual inspection of the tables using the package visdat
and combine all the tables in a single plot.
#Visualize the content of the tables
<- lapply(tbls_lst,visdat::vis_dat)) (vis_dat_lst
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
## Please report the issue at <https://github.com/ropensci/visdat/issues>.
## $list_of_hosts
##
## $total_bids_by_country
##
## $host_country_performances
#Add title of the data to the plot
<- lapply(names(vis_dat_lst), function(dat_name){
(vis_dat_lst #subset the plot of by name
<- vis_dat_lst[[dat_name]]
vis #add title
+
vis labs(title = dat_name)
}))
## [[1]]
##
## [[2]]
##
## [[3]]
#combine all the tables in a single plot
%>%
vis_dat_lst ::wrap_plots() &
patchworktheme(legend.position = "bottom")
The plots reveal two issues. First, the type of the column “Years” in the second table is character! We’ll fix this later, but now let’s deal with the second issue. The third table shows missing data in many columns! Let’s have a deeper look on this table
::gt(tbls_lst[[3]]) gt
Year | Team | Result | Note | Pld | W | D | L | GF | GA | GD |
---|---|---|---|---|---|---|---|---|---|---|
1930 | Uruguay | Champions | Best result, later equalled | 4 | 4 | 0 | 0 | 15 | 3 | +12 |
1934 | Italy | Champions | Best result, later equalled | 5 | 4 | 1 | 0 | 12 | 3 | +9 |
1938 | France | Quarter-finals | Best result, later improved | 2 | 1 | 0 | 1 | 4 | 4 | 0 |
1950 | Brazil | Runners-up | Best result, later improved | 6 | 4 | 1 | 1 | 22 | 6 | +16 |
1954 | Switzerland | Quarter-finals | Equalled best result | 4 | 2 | 0 | 2 | 11 | 11 | 0 |
1958 | Sweden | Runners-up | Best result | 6 | 4 | 1 | 1 | 12 | 7 | +5 |
1962 | Chile | Third place | Best result | 6 | 4 | 0 | 2 | 10 | 8 | +2 |
1966 | England | Champions | Best result | 6 | 5 | 1 | 0 | 11 | 3 | +8 |
1970 | Mexico | Quarter-finals | Best result, later equalled (again as hosts) | 4 | 2 | 1 | 1 | 6 | 4 | +2 |
1974 | West Germany | Champions | Equalled best result, later equalled again | 7 | 6 | 0 | 1 | 13 | 4 | +9 |
1978 | Argentina | Champions | Best result, later equalled | 7 | 5 | 1 | 1 | 15 | 4 | +9 |
1982 | Spain | Second round (top 12) | 5 | 1 | 2 | 2 | 4 | 5 | −1 | |
1986 | Mexico | Quarter-finals | Equalled best result (previous time again as hosts) | 5 | 3 | 2 | 0 | 6 | 2 | +4 |
1990 | Italy | Third place | 7 | 6 | 1 | 0 | 10 | 2 | +8 | |
1994 | United States | Round of 16 | 4 | 1 | 1 | 2 | 3 | 4 | −1 | |
1998 | France | Champions | Best result, later equalled | 7 | 6 | 1 | 0 | 15 | 2 | +13 |
2002 | South Korea | Fourth place | Best result | 7 | 3 | 2 | 2 | 8 | 6 | +2 |
2002 | Japan | Round of 16 | Best result, later equalled | 4 | 2 | 1 | 1 | 5 | 3 | +2 |
2006 | Germany | Third place | 7 | 5 | 1 | 1 | 14 | 6 | +8 | |
2010 | South Africa | First round | Equalled best (and worst) result | 3 | 1 | 1 | 1 | 3 | 5 | −2 |
2014 | Brazil | Fourth place | 7 | 3 | 2 | 2 | 11 | 14 | −3 | |
2018 | Russia | Quarter-finals | Best result as independent nation | 5 | 2 | 2 | 1 | 11 | 7 | +4 |
2022 | Qatar | First round | Best result | 3 | 0 | 0 | 3 | 1 | 7 | -6 |
2026 | Canada | TBD | NA | NA | NA | NA | NA | NA | ||
2026 | Mexico | TBD | NA | NA | NA | NA | NA | NA | ||
2026 | United States | TBD | NA | NA | NA | NA | NA | NA |
Apparently, the missing values are yet-to-be-determined performances of the hosting countries in 2026! Therefore, this shouldn’t be a concern and is not a failure of parsing the table.
Before rushing to the analysis, let’s push the tables through a few rounds of quality control.
let’s start by cleaning column names by handling special characters, spaces, and applying a consistent format.
#names of columns before cleaning
lapply(tbls_lst, colnames)
## $list_of_hosts
## [1] "Year" "Host nation(s)" "Continent"
##
## $total_bids_by_country
## [1] "Country" "Bids" "Years" "Times hosted"
##
## $host_country_performances
## [1] "Year" "Team" "Result" "Note" "Pld" "W" "D" "L"
## [9] "GF" "GA" "GD"
# Clean columns' names
<- lapply(tbls_lst, janitor::clean_names)
tbls_lst #names of columns after cleaning
lapply(tbls_lst, colnames)
## $list_of_hosts
## [1] "year" "host_nation_s" "continent"
##
## $total_bids_by_country
## [1] "country" "bids" "years" "times_hosted"
##
## $host_country_performances
## [1] "year" "team" "result" "note" "pld" "w" "d" "l"
## [9] "gf" "ga" "gd"
One can see that the column with the countries has a different name (“host_nation_s”, “country”, “team”) in each table. Let’s fix this inconsistency and set an new name ( “country_name”) to all of them.
#old inconsistent names
<- c("country", "team", "host_nation_s")
cols_old #new column name
<- "country_name"
col_new #apply the replacement
<- lapply(tbls_lst, function(tbl){
(tbls_lst %>%
tbl rename_with(~ ifelse(.x %in% cols_old,
col_new,
.x)) }))
## $list_of_hosts
## # A tibble: 25 × 3
## year country_name continent
## <int> <chr> <chr>
## 1 1930 Uruguay South America
## 2 1934 Italy Europe
## 3 1938 France Europe
## 4 1942 Cancelled because of World War II Cancelled because of World War II
## 5 1946 Cancelled because of World War II Cancelled because of World War II
## 6 1950 Brazil South America
## 7 1954 Switzerland Europe
## 8 1958 Sweden Europe
## 9 1962 Chile South America
## 10 1966 England Europe
## # … with 15 more rows
##
## $total_bids_by_country
## # A tibble: 35 × 4
## country_name bids years times…¹
## <chr> <int> <chr> <int>
## 1 Germany 8 1938, 1962,[a] 1966,[a]1974,[a]1982,[a]1990,[a]1… 2
## 2 Argentina 5 1938, 1962, 1970, 1978, 2014 1
## 3 England 5 1966, 1990, 1998, 2006, 2018 1
## 4 Italy 5 1930, 1934, 1974, 1982, 1990 2
## 5 Mexico 5 1970, 1978, 1986,[b] 2002, 2026[c] 3
## 6 Morocco 5 1994, 1998, 2006, 2010, 2026 0
## 7 Spain 5 1930, 1966, 1974, 1982, 2018[d] 1
## 8 Brazil 4 1950, 1994, 2006, 2014 2
## 9 Colombia 4 1970, 1978, 1986,[b]2014 1
## 10 United States 4 1986, 1994, 2022, 2026[c] 2
## # … with 25 more rows, and abbreviated variable name ¹times_hosted
##
## $host_country_performances
## # A tibble: 26 × 11
## year country_name result note pld w d l gf ga gd
## <int> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr>
## 1 1930 Uruguay Champions Best… 4 4 0 0 15 3 +12
## 2 1934 Italy Champions Best… 5 4 1 0 12 3 +9
## 3 1938 France Quarter-f… Best… 2 1 0 1 4 4 0
## 4 1950 Brazil Runners-up Best… 6 4 1 1 22 6 +16
## 5 1954 Switzerland Quarter-f… Equa… 4 2 0 2 11 11 0
## 6 1958 Sweden Runners-up Best… 6 4 1 1 12 7 +5
## 7 1962 Chile Third pla… Best… 6 4 0 2 10 8 +2
## 8 1966 England Champions Best… 6 5 1 0 11 3 +8
## 9 1970 Mexico Quarter-f… Best… 4 2 1 1 6 4 +2
## 10 1974 West Germany Champions Equa… 7 6 0 1 13 4 +9
## # … with 16 more rows
Similarly, the year column is called “years” in the second table. Let’s make it consistent with the other tables and rename it to “year”.
$total_bids_by_country <- tbls_lst$total_bids_by_country %>%
tbls_lst::rename(year = "years") dplyr
Next, we need to insure that the data is “tidy”. Obviously, this is not the case for the table below where the column column “years” show mutliple dates concatenated in the same row
$total_bids_by_country tbls_lst
## # A tibble: 35 × 4
## country_name bids year times…¹
## <chr> <int> <chr> <int>
## 1 Germany 8 1938, 1962,[a] 1966,[a]1974,[a]1982,[a]1990,[a]1… 2
## 2 Argentina 5 1938, 1962, 1970, 1978, 2014 1
## 3 England 5 1966, 1990, 1998, 2006, 2018 1
## 4 Italy 5 1930, 1934, 1974, 1982, 1990 2
## 5 Mexico 5 1970, 1978, 1986,[b] 2002, 2026[c] 3
## 6 Morocco 5 1994, 1998, 2006, 2010, 2026 0
## 7 Spain 5 1930, 1966, 1974, 1982, 2018[d] 1
## 8 Brazil 4 1950, 1994, 2006, 2014 2
## 9 Colombia 4 1970, 1978, 1986,[b]2014 1
## 10 United States 4 1986, 1994, 2022, 2026[c] 2
## # … with 25 more rows, and abbreviated variable name ¹times_hosted
what we need to do is to split years of bids into separate entries and convert it to numeric
#separate concatenated years into separate rows
$total_bids_by_country <- tbls_lst$total_bids_by_country %>%
(tbls_lstmutate(year = str_extract_all(year, "[0-9]+")) %>%
unnest(year) %>%
mutate(year = as.numeric(year)))
## # A tibble: 90 × 4
## country_name bids year times_hosted
## <chr> <int> <dbl> <int>
## 1 Germany 8 1938 2
## 2 Germany 8 1962 2
## 3 Germany 8 1966 2
## 4 Germany 8 1974 2
## 5 Germany 8 1982 2
## 6 Germany 8 1990 2
## 7 Germany 8 1998 2
## 8 Germany 8 2006 2
## 9 Argentina 5 1938 1
## 10 Argentina 5 1962 1
## # … with 80 more rows
and do the same thing by splitting cohosts of the same world cup (e.g. “Japan South Korea”) into separate rows entries (“Japan”, “South Korea”).
#separate cohosting countries into separate entries
$list_of_hosts <- tbls_lst$list_of_hosts %>%
tbls_lstmutate(country_name = str_split(country_name, "\\s{2}")) %>%
unnest(country_name) %>%
::rename(host_year = "year") dplyr
Have a look on separate_rows()
for another way to achieve the same effect
Next, let’s give a meaningful order to the results of the teams.
#order of the results
<- c("Champions",
results_order "Runners-up",
"Third place",
"Fourth place",
"Quarter-finals",
"Round of 16",
"Second round",
"First round",
"TBD"
)#set the order
$host_country_performances <- tbls_lst$host_country_performances %>%
tbls_lstmutate(result = ifelse(result == "Second round (top 12)", "Second round", result),
result = factor(result, levels = results_order))
And we’ll end this part by defining a new column with country code (isoc2).
#get iso2 code of each country (flags for Yugoslavia and England are missing)
<- lapply(tbls_lst,function(tbl){
(tbls_lst $country_code <- countrycode::countrycode(tbl$country_name,
tbl"country.name",# the provided country label
"iso2c"# the country code
)
tbl }))
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Cancelled because of World War II, England
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: England, Yugoslavia
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: England
## $list_of_hosts
## # A tibble: 28 × 4
## host_year country_name continent count…¹
## <int> <chr> <chr> <chr>
## 1 1930 Uruguay South America UY
## 2 1934 Italy Europe IT
## 3 1938 France Europe FR
## 4 1942 Cancelled because of World War II Cancelled because of Wor… <NA>
## 5 1946 Cancelled because of World War II Cancelled because of Wor… <NA>
## 6 1950 Brazil South America BR
## 7 1954 Switzerland Europe CH
## 8 1958 Sweden Europe SE
## 9 1962 Chile South America CL
## 10 1966 England Europe <NA>
## # … with 18 more rows, and abbreviated variable name ¹country_code
##
## $total_bids_by_country
## # A tibble: 90 × 5
## country_name bids year times_hosted country_code
## <chr> <int> <dbl> <int> <chr>
## 1 Germany 8 1938 2 DE
## 2 Germany 8 1962 2 DE
## 3 Germany 8 1966 2 DE
## 4 Germany 8 1974 2 DE
## 5 Germany 8 1982 2 DE
## 6 Germany 8 1990 2 DE
## 7 Germany 8 1998 2 DE
## 8 Germany 8 2006 2 DE
## 9 Argentina 5 1938 1 AR
## 10 Argentina 5 1962 1 AR
## # … with 80 more rows
##
## $host_country_performances
## # A tibble: 26 × 12
## year countr…¹ result note pld w d l gf ga gd count…²
## <int> <chr> <fct> <chr> <int> <int> <int> <int> <int> <int> <chr> <chr>
## 1 1930 Uruguay Champ… Best… 4 4 0 0 15 3 +12 UY
## 2 1934 Italy Champ… Best… 5 4 1 0 12 3 +9 IT
## 3 1938 France Quart… Best… 2 1 0 1 4 4 0 FR
## 4 1950 Brazil Runne… Best… 6 4 1 1 22 6 +16 BR
## 5 1954 Switzer… Quart… Equa… 4 2 0 2 11 11 0 CH
## 6 1958 Sweden Runne… Best… 6 4 1 1 12 7 +5 SE
## 7 1962 Chile Third… Best… 6 4 0 2 10 8 +2 CL
## 8 1966 England Champ… Best… 6 5 1 0 11 3 +8 <NA>
## 9 1970 Mexico Quart… Best… 4 2 1 1 6 4 +2 MX
## 10 1974 West Ge… Champ… Equa… 7 6 0 1 13 4 +9 DE
## # … with 16 more rows, and abbreviated variable names ¹country_name,
## # ²country_code
Quality control is not over yet! We need to manually apply some historical modification to the data.
First, let’s correct the entry of Colombia. After being chosen as a host in 1986, the country had to withdrew from hosting the cup due to economic concerns.
$total_bids_by_country <- tbls_lst$total_bids_by_country %>%
tbls_lstmutate(times_hosted = ifelse(country_name == "Colombia", 0, times_hosted))
Second, as Berlin Wall was brought to the ground few decades ago, let’s replace West Germany” with “Germany”.
<- lapply(tbls_lst, function(tbl){
tbls_lst %>%
tbl mutate(across(where(is.character), #select character columns
~ str_replace(.x, "West Germany", "Germany") #define replacement
)
) })
Now that the data is analysis-ready, it is time to explore some interesting questions!