Chapter 13 in ‘R for Data Science’ by Hadley Wickham & Garrett Grolemund

Other resources:

Jenny Bryan’s STAT545 Chapter 14 When one tibble is not enough


  • Reminder that we will have short students presentations during our final class on April 14 (details here) - start thinking about what you want to show the class
  • Homework 6 is due tonight
  • Homework 7 will be posted later today

Plan for today and learning objectives

First, we’ll finish talking about data export (wrap up of lecture 11 from last week) and recap on how to save data and plots.

Then we’ll switch gears to talk about relational data. By the end of today’s class, you should be able to:

  • Combine information from multiple tables into one
  • Describe the difference between the four join and two filter functions in dplyr
  • Select and apply the appropriate join function in common use scenarios


Load the tidyverse


Today, we will be practicing joins on data on flights departing NYC in 2013. These data are compiled in a package that we will install and load

library(nycflights13)  # install.packages("nycflights13")

Relational data

From R for Data Science:

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

There are four main types of operations that can be done with two tables:

  • Binding, which simply stacks tables on top of or beside each other

  • Mutating joins, which add new variables to one data frame from matching observations in another

  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table

  • Set operations, which treat observations as if they were set elements.

We will only cover the first three today. Let’s click on the links in turn to work through the corresponding section in either Jenny Bryan’s STAT 545 notes or Grolemund and Wickham’s R for Data Science.

Row and columns binding

We’ll play around with the Jenny Bryan’s Lord of the Rings Data that we also used in our Tidy Data lecture.

Let’s load it in

fship <- read_csv("")

ttow <- read_csv("")

rking <- read_csv("")

# And try binding these separate dataframes together

lotr_untidy <- dplyr::bind_rows(fship, ttow, rking)
lotr_untidy <- base::rbind(fship, ttow, rking)

Then we’ll review Jenny Bryan’s overview of binding

Join functions

To practice the join functions, we’ll subset the flights dataframe as follows

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

Then we’ll review Wickham and Grolemund’s overview of join functions summarized in these slides

Key point

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

Optional exercises (from the R for Data Science chapter)

  1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

    library(maps) #install.packages("maps")
    airports %>%
      semi_join(flights, c("faa" = "dest")) %>%
      ggplot(aes(lon, lat)) +
        borders("state") +
        geom_point() +

    (Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.)

    You might want to use the size or colour of the points to display the average delay for each airport.

  1. Add the location of the origin and destination (i.e. the lat and lon) to flights.

  1. Is there a relationship between the age of a plane and its delays?