Readings


Required:

Chapter 13 in ‘R for Data Science’ by Hadley Wickham & Garrett Grolemund


Other resources:

Jenny Bryan’s STAT545 Chapter 14 When one tibble is not enough



Announcements



Plan for today and learning objectives

First, we’ll do a quick follow-up to Erika’s lecture on getting help last week. We’ll work through slides from Lecture 2 from ESM 206: Statistics and Data Analysis in Environmental Science and Management taught at UCSB by Allison Horst.


Then we’ll switch gears to talk about relational data. By the end of today’s class, you should be able to:



Setup

Load the tidyverse

library(tidyverse)


Today, we will be practicing joins on data on flights departing NYC in 2013. These data are compiled in a package that we will install and load

install.packages("nycflights13")
library(nycflights13)  # install.packages("nycflights13")



Relational data

From R for Data Science:

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

There are four main types of operations that can be done with two tables:

We will only cover the first three today. Let’s click on the links to work through the corresponding section in Jenny Bryan’s STAT 545 notes or Grolemund and Wickham’s R for Data Science.


Row and columns binding

We’ll play around with the Jenny Bryan’s Lord of the Rings Data that we also used in our Tidy Data lecture.

Let’s load it in

fship <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")

ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")

rking <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")


# And try binding these separate dataframes together

lotr_untidy <- dplyr::bind_rows(fship, ttow, rking)
lotr_untidy <- base::rbind(fship, ttow, rking)


Then we’ll review Jenny Bryan’s overview of binding


Join functions

To practice the join functions, we’ll subset the flights dataframe as follows

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2


Then we’ll review Wickham and Grolemund’s overview of join functions summarized in these slides


Key point

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.



Optional exercises (from the R for Data Science chapter)


  1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

    library(maps) #install.packages("maps")
    
    airports %>%
      semi_join(flights, c("faa" = "dest")) %>%
      ggplot(aes(lon, lat)) +
        borders("state") +
        geom_point() +
        coord_quickmap()

    (Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.)

    You might want to use the size or colour of the points to display the average delay for each airport.


  1. Add the location of the origin and destination (i.e. the lat and lon) to flights.


  1. Is there a relationship between the age of a plane and its delays?