Lesson 12: Relational data

Readings

Required:

Chapter 13 in ‘R for Data Science’ by Hadley Wickham & Garrett Grolemund

Other resources:

Jenny Bryan’s STAT545 Chapter 14 When one tibble is not enough

Announcements

Thanks for the mid-semester feedback
Reminder that we will have short students presentations during our final class on November 11 (details here) - start thinking about what you want to show the class
Homework 5 due tonight - it sounds like people have struggled with this one. As long as you demonstrate some effort, we won’t subtract credit if you don’t get all the way on each question
Homework 6 will be posted later today

Plan for today and learning objectives

First, we’ll do a quick follow-up to Erika’s lecture on getting help last week. We’ll work through slides from Lecture 2 from ESM 206: Statistics and Data Analysis in Environmental Science and Management taught at UCSB by Allison Horst.

Then we’ll switch gears to talk about relational data. By the end of today’s class, you should be able to:

Combine information from multiple tables into one
Describe the difference between the four join and two filter functions in dplyr
Select and apply the appropriate join function in common use scenarios

Setup

Load the tidyverse

library(tidyverse)

Today, we will be practicing joins on data on flights departing NYC in 2013. These data are compiled in a package that we will install and load

install.packages("nycflights13")
library(nycflights13)  # install.packages("nycflights13")

Relational data

From R for Data Science:

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

There are four main types of operations that can be done with two tables:

Binding, which simply stacks tables on top of or beside each other
Mutating joins, which add new variables to one data frame from matching observations in another
Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table
Set operations, which treat observations as if they were set elements.

We will only cover the first three today. Let’s click on the links to work through the corresponding section in Jenny Bryan’s STAT 545 notes or Grolemund and Wickham’s R for Data Science.

Row and columns binding

We’ll play around with the Jenny Bryan’s Lord of the Rings Data that we also used in our Tidy Data lecture.

Let’s load it in

fship <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")

ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")

rking <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")


# And try binding these separate dataframes together

lotr_untidy <- dplyr::bind_rows(fship, ttow, rking)
lotr_untidy <- base::rbind(fship, ttow, rking)

Then we’ll review Jenny Bryan’s overview of binding

Join functions

To practice the join functions, we’ll subset the flights dataframe as follows

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

Then we’ll review Wickham and Grolemund’s overview of join functions summarized in these slides

Key point

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

Optional exercises (from the R for Data Science chapter)

Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
```
library(maps) #install.packages("maps")

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()
```
(Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.)

You might want to use the size or colour of the points to display the average delay for each airport.

Add the location of the origin and destination (i.e. the lat and lon) to flights.

Is there a relationship between the age of a plane and its delays?