Jenny Bryan’s lectures from STAT545 at UBC: Introduction to dplyr
Software Carpentry’s R for reproducible scientific analysis materials: Dataframe manipulation with dplyr
Last class, we learned how to use dplyr
functions
filter()
for subsetting data with row logicselect()
for subsetting data variable- or
column-wise|>
) to implement function chainsToday, we’ll expand our data wrangling toolbox. By the end of today’s class, you should be able to:
dplyr
functions:
mutate()
arrange()
summarize()
and group_by()
Acknowledgements: Today’s lecture is adapted (with permission) from the excellent Ocean Health Index Data Science Training with additional input from Jenny Bryan’s lectures from STAT545 at UBC: Introduction to dplyr and Grolemund and Wickham’s R for Data Science.
select()
and
filter()
commandsSubset the coronavirus dataset to only include the daily counts of confirmed cases in countries located above 60 degree latitude. What are those countries?
If you have time, pipe it into ggplot() to visualize the trends over time in these countries.
# One way to do this:
coronavirus |>
filter(lat > 60, type == "confirmed") |>
select(country) |>
table()
coronavirus |>
filter(lat > 60, type == "confirmed") |>
ggplot() +
geom_line(mapping = aes(x = date, y = cases, color = country))
# If you try to plot the counts by date, you may see a very rugged pattern for Canada because case counts here are reported for different provinces. To get total daily counts by country we'll have to first group_by country and date and then summarize the counts, as outlined below.
mutate()
adds new variablesAlright, let’s keep going.
Besides selecting sets of existing columns, it’s often useful to add
new columns that are functions of existing columns. That’s the job of
mutate()
.
Visually, we are doing this (image from an earlier version of RStudio’s dplyr cheatsheet):
The current variables in the coronavirus dataset don’t lend
themselves well to cross-computation, so to illustrate the power of the
mutate()
function, let’s look at the companion dataset from
the coronavirus
package on global vaccination statistics.
Details about the dataset can be found on the README file here, and the raw
.csv dataset is available here https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/covid19_vaccine.csv
(you can also navigate to it from the cvs
subdirectory and
click View raw
to get the URL)
Let’s first load in the data the same way we did for the case counts dataset
vacc <- read_csv("https://raw.githubusercontent.com/RamiKrispin/coronavirus/main/csv/covid19_vaccine.csv")
## Rows: 142597 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): country_region, continent_name, continent_code, combined_key, iso2...
## dbl (7): doses_admin, people_at_least_one_dose, population, uid, code3, lat...
## lgl (1): fips
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Let's have a look at this
View(vacc) # remember to run this only from the console or an R script. Don't put this command in an .Rmd file that you're going to knit
This dataset has statistics on the cumulative number of vaccine doses administered and number of people vaccinated on each day in different countries. For now, let’s just look at the most recent day included in the dataset 2023-03-09.
Imagine we want to compare the vaccination rate across countries. We
can divide the people_at_least_one_dose
column with the
population
column to create a new column named
vaxxrate
. We do this with mutate()
that is a
function that defines and inserts new variables into a tibble. You can
refer to existing variables diretly by name (i.e. without the
$
operator). To make it easier to see what happens, we’ll
also select just a subset of the variables to retain from our
dataset.
vacc |>
filter(date == max(date)) |>
select(country_region, continent_name, people_at_least_one_dose, population) |>
mutate(vaxxrate = people_at_least_one_dose / population)
# We can modify the mutate equation in many ways. For example, if we want to adjust the number of significant digits printed, we can type
vacc |>
filter(date == max(date)) |>
select(country_region, continent_name, people_at_least_one_dose, population) |>
mutate(vaxxrate = round(people_at_least_one_dose / population, 2))
Add a new variable that shows how many doses of vaccine on average have been distributed per person considered fully vaccinated for each country. First look the patterns across all countries, then only show data for countries that have distributed more than 200 million doses. How many countries is that?
# To look at patterns across all countries
vacc |>
filter(date == max(date)) |>
select(country_region, continent_name, doses_admin, people_at_least_one_dose, population) |>
mutate(doses_per_vaxxed = doses_admin / people_at_least_one_dose)
# We can scroll through the output and eyeball patterns, but we can also pipe the output directly into ggplot!
vacc |>
filter(date == max(date)) |>
select(country_region, continent_name, doses_admin, people_at_least_one_dose, population) |>
mutate(doses_per_vaxxed = doses_admin / people_at_least_one_dose) |>
ggplot() +
geom_histogram(mapping = aes(x = doses_per_vaxxed))
# To subset the output, we can add another filter step
vacc |>
filter(date == max(date)) |>
select(country_region, continent_name, doses_admin, people_at_least_one_dose, population) |>
mutate(doses_per_vaxxed = doses_admin / people_at_least_one_dose) |>
filter(doses_admin > 200 * 10^6)
# We can also use this to identify outliers in our histogram above
vacc |>
filter(date == max(date)) |>
select(country_region, continent_name, doses_admin, people_at_least_one_dose, population) |>
mutate(doses_per_vaxxed = doses_admin / people_at_least_one_dose) |>
filter(doses_per_vaxxed > 3)
When you’re done, sync your RMarkdown file to Github.com (pull, stage, commit, push).
arrange()
orders rowsFor examining the output of our previous calculations, we may want to
re-arrange the countries in ascending or descending order for
vaccination rate. The dplyr
function for sorting rows is
arrange()
.
vacc |>
filter(date == max(vacc$date)) |>
select(country_region, continent_name, doses_admin, people_at_least_one_dose, population) |>
mutate(doses_per_vaxxed = doses_admin / people_at_least_one_dose) |>
arrange(doses_per_vaxxed)
# or
vacc |>
filter(date == max(vacc$date)) |>
select(country_region, continent_name, doses_admin, people_at_least_one_dose, population) |>
mutate(doses_per_vaxxed = doses_admin / people_at_least_one_dose) |>
arrange(-doses_per_vaxxed)
I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.
In how many countries do >90% of the population have at least one dose and which five countries have the highest vaccination rates (proportion of their population given at least one dose), according to this dataset?
vacc |>
filter(date == max(vacc$date)) |>
select(country_region, continent_name, people_at_least_one_dose, population) |>
mutate(vaxxrate = round(people_at_least_one_dose / population, 2)) |>
arrange(-vaxxrate) |>
filter(vaxxrate > 0.9)
- Go back to our original dataset
coronavirus
and identify where and when the highest death count in a single day was observed. Hint: you can either use orbase::max
ordplyr::arrange()
.
- The first case was confirmed in the US on January 20 2020, two days before the earliest day included in this dataset. When was the first confirmed case recorded in Canada?
# Identifying the record with the highest death count
coronavirus |>
filter(type == "death") |>
arrange(-cases)
# We can also just identify the top hit
coronavirus |>
filter(type == "death") |>
filter(cases == max(cases))
# The first recorded case in Canada
coronavirus |>
filter(country == "Canada", cases > 0) |>
arrange(date)
Knit your RMarkdown file, and sync it to GitHub (pull, stage, commit, push)
summarize()
and
group_by
The last key dplyr
verb is summarize()
. It
collapses a data frame to a single row. Visually, we are doing this
(image from an earlier version of RStudio’s dplyr cheatsheet):
We can use it to calculate the total number of confirmed cases detected globally since 1-22-2020 (the beginning of this dataset)
coronavirus |>
filter(type == "confirmed") |>
summarize(sum = sum(cases))
This number could also easily have been computed with base-R
functions. In general, summarize()
is not terribly useful
unless we pair it with group_by()
. This changes the unit of
analysis from the complete dataset to individual groups. Then, when you
use the dplyr
verbs on a grouped data frame they’ll be
automatically applied “by group”. For example, if we applied exactly the
same code to a data frame grouped by country, we get the total number of
confirmed cases for each country or region.
coronavirus |>
filter(type == "confirmed") |>
group_by(country) |>
summarize(total_cases = sum(cases))
Now that’s a lot more useful!
We can also use summarize()
to check how many
observations (dates) we have for each country
coronavirus |>
filter(type == "confirmed") |>
group_by(country) |>
summarize(n = n())
Why do some countries have much higher counts than others?
We can also do multi-level grouping. If we wanted to know how many of each type of case there were globally on Jan 1, 2023, we could chain these functions together:
coronavirus |>
group_by(date, type) |>
summarize(total = sum(cases)) |> # sums the count across countries
filter(date == "2023-01-01")
Which day has had the highest total death count globally reported in this dataset?
Pipe your global daily death counts into ggplot to visualize the trend over time.
coronavirus |>
filter(type == "death") |>
group_by(date) |>
summarize(total_deaths = sum(cases)) |>
arrange(-total_deaths)
# Or
coronavirus |>
filter(type == "death") |>
group_by(date) |>
summarize(total_deaths = sum(cases)) |>
filter(total_deaths == max(total_deaths))
# With plotting
coronavirus |>
filter(type == "death") |>
group_by(date) |>
summarize(total_deaths = sum(cases)) |>
arrange(-total_deaths) |>
ggplot() +
geom_line(aes(x = date, y = total_deaths))
## We see that the last day of the dataset has a much higher reported count than all other dates, so let's remove that
coronavirus |>
filter(type == "death", date != "2023-01-04") |>
group_by(date) |>
summarize(total_deaths = sum(cases)) |>
arrange(-total_deaths) |>
ggplot() +
geom_line(aes(x = date, y = total_deaths))
The month()
function from the package
lubridate
extracts the month from a date and
year()
extracts the year. How many countries already had
more than 10,000 deaths in February of 2021?
library(lubridate) #install.packages('lubridate')
coronavirus |>
mutate(month = month(date), year = year(date)) |>
filter(type == "death", month == 2, year == 2021) |>
group_by(country) |>
summarize(total_death = sum(cases)) |>
filter(total_death > 10000)
coronavirus |>
select(-lat, -long) |>
filter(date == "2020-10-04", type == "death") |>
arrange(-cases)
month()
and year()
from the package lubridate
might be helpful]. What about in March?library(lubridate) #install.packages('lubridate')
coronavirus |>
mutate(month = month(date), year = year(date)) |>
filter(type == "confirmed", month == 1, year == 2021) |>
group_by(country) |>
summarize(total_death = sum(cases)) |>
arrange(-total_death)
If you’re used to working in base R, answer the same question
with base R tools. Which coding approach do you like better or what are
pros and cons of the two types of syntax?
coronavirus |>
group_by(country, date) |>
summarize(n = n()) |>
group_by(country) |>
summarize(maxcount = max(n)) |>
filter(maxcount > 3)