First, we’ll do a quick follow-up to Erika’s lecture on getting help last week. We’ll work through slides from Lecture 2 from ESM 206: Statistics and Data Analysis in Environmental Science and Management taught at UCSB by Allison Horst.
Then we’ll switch gears to talk about relational data. By the end of today’s class, you should be able to:
join
and two filter
functions in dplyr
Load the tidyverse
library(tidyverse)
Today, we will be practicing joins on data on flights departing NYC in 2013. These data are compiled in a package that we will install and load
install.packages("nycflights13")
library(nycflights13) # install.packages("nycflights13")
From R for Data Science:
It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
There are four main types of operations that can be done with two tables:
Binding, which simply stacks tables on top of or beside each other
Mutating joins, which add new variables to one data frame from matching observations in another
Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table
Set operations, which treat observations as if they were set elements.
We will only cover the first three today. Let’s click on the links to work through the corresponding section in Jenny Bryan’s STAT 545 notes or Grolemund and Wickham’s R for Data Science.
We’ll play around with the Jenny Bryan’s Lord of the Rings Data that we also used in our Tidy Data lecture.
Let’s load it in
fship <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
rking <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")
# And try binding these separate dataframes together
lotr_untidy <- dplyr::bind_rows(fship, ttow, rking)
lotr_untidy <- base::rbind(fship, ttow, rking)
Then we’ll review Jenny Bryan’s overview of binding
To practice the join
functions, we’ll subset the flights
dataframe as follows
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
Then we’ll review Wickham and Grolemund’s overview of join functions summarized in these slides
The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
Compute the average delay by destination, then join on the airports
data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:
library(maps) #install.packages("maps")
airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
(Don’t worry if you don’t understand what semi_join()
does — you’ll learn about it next.)
You might want to use the size
or colour
of the points to display the average delay for each airport.
lat
and lon
) to flights
.