Readings


Required:


Other resources:

Jenny Bryan’s STAT545 Chapter 14 When one tibble is not enough



Announcements

  • You will all have the opportunity to give a short presentation during our last class (required if you’re taking the course for credit). Details here



Plan for today and learning objectives

Today we’ll talk about relational data. By the end of today’s class, you should be able to:

  • Combine information from multiple tables into one
  • Describe the difference between the four join and two filter functions in dplyr
  • Select and apply the appropriate join function in common use scenarios



Setup

Load the tidyverse

library(tidyverse)


We will be practicing joins on data on flights departing NYC in 2013. These data are compiled in a package that we will install and load

install.packages("nycflights13")
library(nycflights13)  # install.packages("nycflights13")



Relational data

From R for Data Science:

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

There are four main types of operations that can be done with two tables:

  • Binding, which simply stacks tables on top of or beside each other

  • Mutating joins, which add new variables to one data frame from matching observations in another

  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table

  • Set operations, which treat observations as if they were set elements.

We will only cover the first three today. Let’s click on the links in turn to work through the corresponding section in either Jenny Bryan’s STAT 545 notes or Grolemund and Wickham’s R for Data Science.


Row and column binding

We’ll play around with the Jenny Bryan’s Lord of the Rings Data that we also used in our Tidy Data lecture.

Let’s load it in

fship <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")

ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")

rking <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")


# And try binding these separate dataframes together

lotr_untidy <- dplyr::bind_rows(fship, ttow, rking)
lotr_untidy <- base::rbind(fship, ttow, rking)


Then we’ll review Jenny Bryan’s overview of binding


Join functions

To practice the join functions, we’ll subset the flights dataframe as follows

flights2 <- flights |> 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2


Then we’ll review the r4ds overview of join functions summarized in these slides


Key point

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.



Optional exercises (from the R for Data Science chapter)


  1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
    library(maps) #install.packages("maps")

    airports |>
      semi_join(flights, c("faa" = "dest")) |>
      ggplot(aes(lon, lat)) +
        borders("state") +
        geom_point() +
        coord_quickmap()
(Don't worry if you don't understand what `semi_join()` does --- you'll
learn about it next.)

You might want to use the `size` or `colour` of the points to display
the average delay for each airport.


  1. Add the location of the origin and destination (i.e. the lat and lon) to flights.


  1. Is there a relationship between the age of a plane and its delays?