Expand and Complete with tidyr

The tidyr package features some of the most useful data manipulation utilities in R. This post demonstrates expand() and complete(), which can be used to create data frames based on combinations of variables.

Data prep

First, we’ll prep some data for the example. The code below creates a tibble with the records for the UVA and Gonzaga men’s basketball programs for two recent seasons. Gonzaga has data for the 2016-2017 and 2017-2018 seasons and UVA has data for the 2017-2018 and 2018-2019. Each row includes wins and losses for the given team and season, aggregated to the opponent conference:

library(tidyr)
library(dplyr)

mcbb_teams <-
  tribble(~ team, ~ opp_conf, ~ season, ~ wins, ~ losses,
          "UVA", "ACC", "2018-2019", 17, 3,
          "UVA", "Colonial", "2018-2019", 2, 0,
          "UVA", "Atlantic 10", "2018-2019", 3, 0,
          "UVA", "MEAC", "2018-2019", 2, 0,
          "UVA", "Conference USA", "2018-2019", 2, 0,
          "UVA", "Big Ten", "2018-2019", 3, 0,
          "UVA", "SEC", "2018-2019", 2, 0,
          "UVA", "Big South", "2018-2019", 1, 0,
          "UVA", "Big 12", "2018-2019", 2, 0,
          "UVA", "Pac-12", "2018-2019", 1, 0,
          "UVA", "ACC", "2017-2018", 20, 1,
          "UVA", "Big 12", "2017-2018", 0, 1,
          "UVA", "Southern", "2017-2018", 1, 0,
          "UVA", "Ohio Valley", "2017-2018", 1, 0,
          "UVA", "MAAC", "2017-2018", 1, 0,
          "UVA", "MEAC", "2017-2018", 2, 0,
          "UVA", "American East", "2017-2018", 0, 1,
          "UVA", "Atlantic 10", "2017-2018", 3, 0,
          "UVA", "Patriot League", "2017-2018", 1, 0,
          "UVA", "SEC", "2017-2018", 1, 0,
          "UVA", "Big Ten", "2017-2018", 1, 0,
          "Gonzaga", "WCC", "2017-2018", 20, 1,
          "Gonzaga", "Southern", "2017-2018", 1, 0,
          "Gonzaga", "ACC", "2017-2018", 0, 1,
          "Gonzaga", "Big Ten", "2017-2018", 2, 0,
          "Gonzaga", "Mountain West", "2017-2018", 1, 1,
          "Gonzaga", "Big East", "2017-2018", 1, 1,
          "Gonzaga", "SWAC", "2017-2018", 1, 0,
          "Gonzaga", "MEAC", "2017-2018", 1, 0,
          "Gonzaga", "SEC", "2017-2018", 0, 1,
          "Gonzaga", "Big 12", "2017-2018", 1, 0,
          "Gonzaga", "Pac-12", "2017-2018", 1, 0,
          "Gonzaga", "Big Sky", "2017-2018", 1, 0,
          "Gonzaga", "Horizon", "2017-2018", 1, 0,
          "Gonzaga", "Southland", "2017-2018", 1, 0,
          "Gonzaga", "WCC", "2016-2017", 20, 1,
          "Gonzaga", "WAC", "2016-2017", 1, 0,
          "Gonzaga", "Mountain West", "2016-2017", 1, 0,
          "Gonzaga", "Northeast", "2016-2017", 1, 0,
          "Gonzaga", "MAAC", "2016-2017", 1, 0,
          "Gonzaga", "SWAC", "2016-2017", 1, 0,
          "Gonzaga", "SEC", "2016-2017", 3, 0,
          "Gonzaga", "Big 12", "2016-2017", 2, 0,
          "Gonzaga", "Pac-12", "2016-2017", 2, 0,
          "Gonzaga", "Mid-American Conference", "2016-2017", 1, 0,
          "Gonzaga", "Summit League", "2016-2017", 2, 0,
          "Gonzaga", "Big Ten", "2016-2017", 1, 0,
          "Gonzaga", "Big East", "2016-2017", 1, 0,
          "Gonzaga", "ACC", "2016-2017", 0, 1)
          
mcbb_teams
## # A tibble: 49 x 5
##    team  opp_conf       season     wins losses
##    <chr> <chr>          <chr>     <dbl>  <dbl>
##  1 UVA   ACC            2018-2019    17      3
##  2 UVA   Colonial       2018-2019     2      0
##  3 UVA   Atlantic 10    2018-2019     3      0
##  4 UVA   MEAC           2018-2019     2      0
##  5 UVA   Conference USA 2018-2019     2      0
##  6 UVA   Big Ten        2018-2019     3      0
##  7 UVA   SEC            2018-2019     2      0
##  8 UVA   Big South      2018-2019     1      0
##  9 UVA   Big 12         2018-2019     2      0
## 10 UVA   Pac-12         2018-2019     1      0
## # … with 39 more rows

Summarizing the records overall:

mcbb_teams %>%
  group_by(team, season) %>%
  summarise(wins = sum(wins),
            losses = sum(losses),
            .groups = "drop")
## # A tibble: 4 x 4
##   team    season     wins losses
##   <chr>   <chr>     <dbl>  <dbl>
## 1 Gonzaga 2016-2017    37      2
## 2 Gonzaga 2017-2018    32      5
## 3 UVA     2017-2018    31      3
## 4 UVA     2018-2019    35      3

expand()

The expand() function creates a tibble from combinations of input values.

In a simple incantation, we can expand on a single variable:

mcbb_teams %>%
  expand(team)
## # A tibble: 2 x 1
##   team   
##   <chr>  
## 1 Gonzaga
## 2 UVA

The “combinations” above are just unique values for the single column specified. A more useful scenario might be to specify multiple variables. By default the function will return all unique combinations of the values in those columns. We know there are 3 unique seasons and 2 unique teams in the example data, so we would expect 6 rows in the expanded data frame:

mcbb_teams %>%
  expand(team, season)
## # A tibble: 6 x 2
##   team    season   
##   <chr>   <chr>    
## 1 Gonzaga 2016-2017
## 2 Gonzaga 2017-2018
## 3 Gonzaga 2018-2019
## 4 UVA     2016-2017
## 5 UVA     2017-2018
## 6 UVA     2018-2019

Note that the two teams did not have data from the same seasons. UVA was missing data from 2016-2017 and Gonzaga was missing data from 2018-2019.

By default the expand() will create all possible combinations whether or not the combinations already appear in the data. To modify this behavior, you can use the nesting()1 helper to create only the combinations that are present:

mcbb_teams %>%
  expand(nesting(team, season))
## # A tibble: 4 x 2
##   team    season   
##   <chr>   <chr>    
## 1 Gonzaga 2016-2017
## 2 Gonzaga 2017-2018
## 3 UVA     2017-2018
## 4 UVA     2018-2019

crossing() is another helper. In this case, when used like nesting() above it will generate the same output as the original call to expand():

mcbb_teams %>%
  expand(crossing(team, season))
## # A tibble: 6 x 2
##   team    season   
##   <chr>   <chr>    
## 1 Gonzaga 2016-2017
## 2 Gonzaga 2017-2018
## 3 Gonzaga 2018-2019
## 4 UVA     2016-2017
## 5 UVA     2017-2018
## 6 UVA     2018-2019

The utility of crossing() is more obvious when used on its own:

tibble(team = c("UVA","UVA","Gonzaga")) %>%
  crossing(season = c("2016-2017","2017-2018"))
## # A tibble: 4 x 2
##   team    season   
##   <chr>   <chr>    
## 1 Gonzaga 2016-2017
## 2 Gonzaga 2017-2018
## 3 UVA     2016-2017
## 4 UVA     2017-2018

The ?tidyr::crossing help documentation notes that:

crossing() is a wrapper around expand_grid() that de-duplicates and sorts its inputs

So using expand_grid() with the same example above will produce all combinations with duplicates:

tibble(team = c("UVA","UVA","Gonzaga")) %>%
  expand_grid(season = c("2016-2017","2017-2018"))
## # A tibble: 6 x 2
##   team    season   
##   <chr>   <chr>    
## 1 UVA     2016-2017
## 2 UVA     2017-2018
## 3 UVA     2016-2017
## 4 UVA     2017-2018
## 5 Gonzaga 2016-2017
## 6 Gonzaga 2017-2018

Notably nesting() does not have an equivalent API when used outside of expand():

tibble(team = c("UVA","UVA","Gonzaga")) %>%
  nesting(season = c("2016-2017","2017-2018"))
## Error: Tibble columns must have compatible sizes.
## * Size 3: Existing data.
## * Size 2: Column `season`.
## ℹ Only values of size one are recycled.

When used inside expand(), the nesting and crossing behavior can be operate in conjunction with one another.

To demonstrate we’ll first expand on team, season, and opponent conference, nesting all columns:

mcbb_teams %>%
  expand(nesting(team, season, opp_conf))
## # A tibble: 49 x 3
##    team    season    opp_conf               
##    <chr>   <chr>     <chr>                  
##  1 Gonzaga 2016-2017 ACC                    
##  2 Gonzaga 2016-2017 Big 12                 
##  3 Gonzaga 2016-2017 Big East               
##  4 Gonzaga 2016-2017 Big Ten                
##  5 Gonzaga 2016-2017 MAAC                   
##  6 Gonzaga 2016-2017 Mid-American Conference
##  7 Gonzaga 2016-2017 Mountain West          
##  8 Gonzaga 2016-2017 Northeast              
##  9 Gonzaga 2016-2017 Pac-12                 
## 10 Gonzaga 2016-2017 SEC                    
## # … with 39 more rows

The original data was organized with one row per team, season, and opponent conference. So the result of the above should yield the same number of rows.

nrow(mcbb_teams)
## [1] 49

Alternatively we can be selective with how we implement nesting. For example, if we wanted all combinations of team and season that were recorded but with the opponent conferences that either team could have played:

mcbb_teams %>%
  expand(nesting(team, season), opp_conf)
## # A tibble: 104 x 3
##    team    season    opp_conf      
##    <chr>   <chr>     <chr>         
##  1 Gonzaga 2016-2017 ACC           
##  2 Gonzaga 2016-2017 American East 
##  3 Gonzaga 2016-2017 Atlantic 10   
##  4 Gonzaga 2016-2017 Big 12        
##  5 Gonzaga 2016-2017 Big East      
##  6 Gonzaga 2016-2017 Big Sky       
##  7 Gonzaga 2016-2017 Big South     
##  8 Gonzaga 2016-2017 Big Ten       
##  9 Gonzaga 2016-2017 Colonial      
## 10 Gonzaga 2016-2017 Conference USA
## # … with 94 more rows

And if we wanted all unique combinations of team, season, and opponent conference:

mcbb_teams %>%
  expand(team, season, opp_conf)
## # A tibble: 156 x 3
##    team    season    opp_conf      
##    <chr>   <chr>     <chr>         
##  1 Gonzaga 2016-2017 ACC           
##  2 Gonzaga 2016-2017 American East 
##  3 Gonzaga 2016-2017 Atlantic 10   
##  4 Gonzaga 2016-2017 Big 12        
##  5 Gonzaga 2016-2017 Big East      
##  6 Gonzaga 2016-2017 Big Sky       
##  7 Gonzaga 2016-2017 Big South     
##  8 Gonzaga 2016-2017 Big Ten       
##  9 Gonzaga 2016-2017 Colonial      
## 10 Gonzaga 2016-2017 Conference USA
## # … with 146 more rows

The expand operation can also accept vectors passed on-the-fly. For example, if we wanted all combinations of team and opponent conferences observed, along with an arbitrary range of seasons:

seasons <- c("2015-2016","2016-2017","2017-2018", "2018-2019", "2019-2020")

mcbb_teams %>%
  expand(nesting(team, opp_conf), season = seasons)
## # A tibble: 170 x 3
##    team    opp_conf season   
##    <chr>   <chr>    <chr>    
##  1 Gonzaga ACC      2015-2016
##  2 Gonzaga ACC      2016-2017
##  3 Gonzaga ACC      2017-2018
##  4 Gonzaga ACC      2018-2019
##  5 Gonzaga ACC      2019-2020
##  6 Gonzaga Big 12   2015-2016
##  7 Gonzaga Big 12   2016-2017
##  8 Gonzaga Big 12   2017-2018
##  9 Gonzaga Big 12   2018-2019
## 10 Gonzaga Big 12   2019-2020
## # … with 160 more rows

complete()

When expanding a data frame to all combinations, we lost the values that might have been of interest. In the example above, that would be wins and losses. We could retrieve those values by joining the original data to the expanded. But a much easier method is to use complete():

mcbb_teams %>%
  complete(team, season, opp_conf)
## # A tibble: 156 x 5
##    team    season    opp_conf        wins losses
##    <chr>   <chr>     <chr>          <dbl>  <dbl>
##  1 Gonzaga 2016-2017 ACC                0      1
##  2 Gonzaga 2016-2017 American East     NA     NA
##  3 Gonzaga 2016-2017 Atlantic 10       NA     NA
##  4 Gonzaga 2016-2017 Big 12             2      0
##  5 Gonzaga 2016-2017 Big East           1      0
##  6 Gonzaga 2016-2017 Big Sky           NA     NA
##  7 Gonzaga 2016-2017 Big South         NA     NA
##  8 Gonzaga 2016-2017 Big Ten            1      0
##  9 Gonzaga 2016-2017 Colonial          NA     NA
## 10 Gonzaga 2016-2017 Conference USA    NA     NA
## # … with 146 more rows

By default combinations that don’t appear in the original data will have NA values populated in the corresponding columns. You can customize this behavior with the “fill” argument, which accepts a named list containing the value to fill for each given column:

mcbb_teams %>%
  complete(team, season, opp_conf, fill=list(wins = "No wins recorded", 
                                             losses = "No losses recorded"))
## # A tibble: 156 x 5
##    team    season    opp_conf       wins             losses            
##    <chr>   <chr>     <chr>          <chr>            <chr>             
##  1 Gonzaga 2016-2017 ACC            0                1                 
##  2 Gonzaga 2016-2017 American East  No wins recorded No losses recorded
##  3 Gonzaga 2016-2017 Atlantic 10    No wins recorded No losses recorded
##  4 Gonzaga 2016-2017 Big 12         2                0                 
##  5 Gonzaga 2016-2017 Big East       1                0                 
##  6 Gonzaga 2016-2017 Big Sky        No wins recorded No losses recorded
##  7 Gonzaga 2016-2017 Big South      No wins recorded No losses recorded
##  8 Gonzaga 2016-2017 Big Ten        1                0                 
##  9 Gonzaga 2016-2017 Colonial       No wins recorded No losses recorded
## 10 Gonzaga 2016-2017 Conference USA No wins recorded No losses recorded
## # … with 146 more rows

complete() also leverages the nesting and crossing helpers:

mcbb_teams %>%
  complete(nesting(team, season), opp_conf)
## # A tibble: 104 x 5
##    team    season    opp_conf        wins losses
##    <chr>   <chr>     <chr>          <dbl>  <dbl>
##  1 Gonzaga 2016-2017 ACC                0      1
##  2 Gonzaga 2016-2017 American East     NA     NA
##  3 Gonzaga 2016-2017 Atlantic 10       NA     NA
##  4 Gonzaga 2016-2017 Big 12             2      0
##  5 Gonzaga 2016-2017 Big East           1      0
##  6 Gonzaga 2016-2017 Big Sky           NA     NA
##  7 Gonzaga 2016-2017 Big South         NA     NA
##  8 Gonzaga 2016-2017 Big Ten            1      0
##  9 Gonzaga 2016-2017 Colonial          NA     NA
## 10 Gonzaga 2016-2017 Conference USA    NA     NA
## # … with 94 more rows

Example

We can build on the demonstration above for a simple analysis. Let’s assume we want to compare the performance of the UVA and Gonzaga programs against their own conferences and the “Power 5” schools in the seasons observed:

## what are the power five conferences?
power_five <- c("ACC","Big Ten","Big 12","Pac-12","SEC")

mcbb_teams %>%
  ## create all combinations of team, season and opponent conference
  ## but only combinations of team/season that were observed
  complete(nesting(team,season), opp_conf) %>%
  ## need to separate season into year1,year2 for formatting team name
  separate(season, into = c("year1","year2"), sep = "-") %>%
  group_by(team) %>%
  ## format the team name with seasons observed
  mutate(team = paste0(team, " (", min(year1), "-", max(year2), ")")) %>%
  ## clean up
  select(-year1,-year2) %>%
  ## add up the number of wins/losses against each conference
  ## combining seasons observed
  group_by(team,opp_conf) %>%
  summarise(wins = sum(wins, na.rm = TRUE),
            losses = sum(losses, na.rm = TRUE),
            .groups = "drop") %>%
  ## format the 'record'
  mutate(record = paste0(wins, "-", losses)) %>%
  select(-wins,-losses) %>%
  ## restrict to power five AND Gonzaga's WCC
  filter(opp_conf %in% c(power_five,"WCC")) %>%
  ## reshape the data to wide format
  spread(opp_conf, record) %>%
  ## reorder columns
  select(team, ACC, WCC, everything()) %>%
  ## make pretty for html :)
  knitr::kable()
team ACC WCC Big 12 Big Ten Pac-12 SEC
Gonzaga (2016-2018) 0-2 40-2 3-0 3-0 3-0 3-1
UVA (2017-2019) 37-4 0-0 2-1 4-0 1-0 3-0

  1. Although they share similar names, the concept behind tidyr::nesting() is not the same as tidyr::nest().↩︎

Related