Required:
Chapter 5 in in R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel & Garrett Grolemund
This cool Twitter thread by Julia Lowndes and Allison Horst. If you can’t access the thread or want more detail, the same material is expanded upon on Julie’s Openscapes website. Openscapes is an awesome organization that champions open practices in environmental science - check it out!
Additional resources:
spread()
and gather()
functions so
don’t get confused by the code.tidyr
vignette on
tidy dataAssignment 4 due tonight - submit by pushing to your class GitHub repo
We’re half way through the course! We will send out a link to a mid-term evaluation form (we’ll use our own form because the official CALS version won’t be ready in time to work with our schedule). We would appreciate your candid feedback so we can make the most of our remaining time together. All responses are completely anonymous.
Your GitHub PAT (personal access token) may expire soon (if you had set it to 30 days). We’ll review how to renew it at the end of class
So far, we’ve only worked with data that were already formatted for efficient processing with tidyverse functions. In this session we’ll learn some tools to help get data into that format - make it tidy and more coder-friendly.
By the end of today’s class, you should be able to:
tidyr::pivot_wider()
and
tidyr::pivot_longer()
to reshape data framestidyr::unite()
and tidyr::separate()
to merge or separate information from different columnsTodays lesson integrates material from multiple sources, including the excellent R for Excel users course by Julia Stewart Lowndes and Allison Horst and several other sources specified below.
Open the R Project associated with your personal class GitHub repository.
PULL to make sure your project is up to date
Create a new R script file and save it as
my_tidying.Rmd
Load the tidyverse
package
# Load packages
library(tidyverse)
“Tidy” might sound like a generic way to describe non-messy looking data, but it actually refers to a specific data structure.
A data set is tidy if:
See: Ch. 12 in R for Data Science by Grolemund & Wickham.
An implication of this definition is that each value belongs to exactly one variable and one observation. This also means that tidy data is relative, as it depends on how you define your observational unit and variables.
A key idea here is that instead of building your analyses around whatever (likely weird) format your data are in, take deliberate steps to make your data tidy. When your data are tidy, you can use a growing assortment of powerful analytical and visualization tools instead of inventing home-grown ways to accommodate your data. This will save you time since you aren’t reinventing the wheel, and will make your work more clear and understandable to your collaborators (most importantly, Future You).
Note that to effectively use ggplot()
your data must be
in tidy format. It also makes it easier to take advantage of R’s
vectorized nature (most built-in R functions work with vectors of
values)
Review of the beautiful slides by Julia Lowndes and Allison Horst for a clear overview of the motivation for working with tidy data.
Let’s go through some examples to get a better understanding of what tidy data look like.
Let’s first compare the four different representations of the same dataset shown in Chapter 12.2 in Grolemund and Wickham’s “R for Data Science” including the same values of four variables country, year, population, and cases.
Which of these representations are in tidy format?
The datasets discussed in Chapter 12 of for Data Science come with
the tidyverse
package, so you can access them by just
typing their names, e.g. table1
, table2
etc.
We computed the infection rate per year 10,000 people for each
year and each country from table1
together using
our now familiar dplyr
wrangling tools.
Your task: Now calculate this same statistic from
table2
and table4a
+ table4b
. You
will need to perform four operations:
Which representation is easiest to work with? Which is hardest? Why?
If I had one thing to tell biologists learning bioinformatics, it would be “write code for humans, write data for computers”. — Vince Buffalo (@vsbuffalo)
Now that the motivation for wanting to work with data in tidy format
hopefully is clear, let’s explore some powerful functions from the
package tidyr
for reshaping data. (tidyr
comes
bundled with the tidyverse
, so we don’t have to install it
separately).
Often, datasets will not be in tidy format because they are organized to facilitate some use other than analysis. For example, data is often organized to make entry or reading by humans as easy as possible.
This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems;
it’ll only suffer from both if you’re really unlucky! To fix these
problems, you’ll need the two most important functions in tidyr:
pivot_longer()
and pivot_wider()
.
We’ll walk through an illustration of how to use these by following Chapter 12.3 Pivoting in Grolemund and Wickham’s “R for Data Science”.
That overview uses a simple example with a small number of variables,
so we can easily list them individually when we use
pivot_longer()
. For datasets with more variables, we can
use more automated ways to index columns, the same helper functions we
used for the select()
function (you can refresh your memory
here). See
examples of this in the useful tidyr
vignette on pivoting.
After reviewing the pivot
functions, let’s continue on
to also take a look at Chapter
12.4 in R for Data Science on separate()
and
unite()
- two simple functions for splitting and combining
information from different columns.
To explore tidy data in a different context, let’s work through a
tutorial developed by Jenny Bryan using data on the Lord of the Rings
movies. This nicely illustrates the concepts of lengtening and widening
datasets. It uses outdated functions for pivoting the dataframes,
however, so we’ll work through updated code here (i.e. only look at the
01-intro.md
file, not the 02-gather.md
and
03-spread.md
- we’ll work through those steps here).
First let’s read the intro (01-intro.md
) here
Then let’s work through reshaping the data.
First, we bring the data into data frames or tibbles, one per film, and do some inspection.
fship <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
rking <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")
We now have one data frame per film, each with a common set of 4
variables. Step one in tidying this data is to glue them together into
one data frame, stacking them up row wise. This is called row binding
and we use dplyr::bind_rows()
.
lotr_untidy <- bind_rows(fship, ttow, rking)
lotr_untidy
## # A tibble: 9 × 4
## Film Race Female Male
## <chr> <chr> <dbl> <dbl>
## 1 The Fellowship Of The Ring Elf 1229 971
## 2 The Fellowship Of The Ring Hobbit 14 3644
## 3 The Fellowship Of The Ring Man 0 1995
## 4 The Two Towers Elf 331 513
## 5 The Two Towers Hobbit 0 2463
## 6 The Two Towers Man 401 3589
## 7 The Return Of The King Elf 183 510
## 8 The Return Of The King Hobbit 2 2673
## 9 The Return Of The King Man 268 2459
We are still violating one of the fundamental principles of
tidy data. “Word count” is a fundamental variable in
our dataset and it’s currently spread out over two variables,
Female
and Male
. Conceptually, we need to
gather up the word counts into a single variable and create a new
variable, Gender
, to track whether each count refers to
females or males. We use the pivot_longer()
function from
the tidyr package to do this.
lotr_tidy <-
pivot_longer(lotr_untidy, c(Male, Female), names_to = 'Gender', values_to = 'Words')
lotr_tidy
## # A tibble: 18 × 4
## Film Race Gender Words
## <chr> <chr> <chr> <dbl>
## 1 The Fellowship Of The Ring Elf Male 971
## 2 The Fellowship Of The Ring Elf Female 1229
## 3 The Fellowship Of The Ring Hobbit Male 3644
## 4 The Fellowship Of The Ring Hobbit Female 14
## 5 The Fellowship Of The Ring Man Male 1995
## 6 The Fellowship Of The Ring Man Female 0
## 7 The Two Towers Elf Male 513
## 8 The Two Towers Elf Female 331
## 9 The Two Towers Hobbit Male 2463
## 10 The Two Towers Hobbit Female 0
## 11 The Two Towers Man Male 3589
## 12 The Two Towers Man Female 401
## 13 The Return Of The King Elf Male 510
## 14 The Return Of The King Elf Female 183
## 15 The Return Of The King Hobbit Male 2673
## 16 The Return Of The King Hobbit Female 2
## 17 The Return Of The King Man Male 2459
## 18 The Return Of The King Man Female 268
Tidy data… mission accomplished!
To explain our call to pivot_longer()
above, let’s read
it from right to left: we took the variables Female
and
Male
and gathered their values into a single new variable
Words
. This forced the creation of a companion variable
Gender
, which tells whether a specific value of
Words
came from Female or Male. All other variables, such
as Film
, remain unchanged and are simply replicated as
needed.
Now we write this multi-film, tidy dataset to file for use in various downstream scripts for further analysis and visualization.
write_csv(lotr_tidy, file = "datasets/lotr_tidy.csv")
After tidying the data and completing your analysis, you may want
to output a table that has each race in its own column. Let’s use the
pivot_wider()
function to make such a table and save it as
“lotr_wide”
OPTIONAL: Use the pivot_longer()
function to
transform you lotr_wide back to tidy format.
# let's get one variable per Race
lotr_tidy |>
pivot_wider(names_from = Race, values_from = Words)
## # A tibble: 6 × 5
## Film Gender Elf Hobbit Man
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 The Fellowship Of The Ring Male 971 3644 1995
## 2 The Fellowship Of The Ring Female 1229 14 0
## 3 The Two Towers Male 513 2463 3589
## 4 The Two Towers Female 331 0 401
## 5 The Return Of The King Male 510 2673 2459
## 6 The Return Of The King Female 183 2 268
# let's get one variable per Gender
lotr_tidy |>
pivot_wider(names_from = Gender, values_from = Words)
## # A tibble: 9 × 4
## Film Race Male Female
## <chr> <chr> <dbl> <dbl>
## 1 The Fellowship Of The Ring Elf 971 1229
## 2 The Fellowship Of The Ring Hobbit 3644 14
## 3 The Fellowship Of The Ring Man 1995 0
## 4 The Two Towers Elf 513 331
## 5 The Two Towers Hobbit 2463 0
## 6 The Two Towers Man 3589 401
## 7 The Return Of The King Elf 510 183
## 8 The Return Of The King Hobbit 2673 2
## 9 The Return Of The King Man 2459 268
# let's get one variable per combo of Race and Gender
lotr_tidy |>
unite(Race_Gender, Race, Gender) |>
pivot_wider(names_from = Race_Gender, values_from = Words)
## # A tibble: 3 × 7
## Film Elf_Male Elf_Female Hobbit_Male Hobbit_Female Man_Male Man_Female
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 The Fellows… 971 1229 3644 14 1995 0
## 2 The Two Tow… 513 331 2463 0 3589 401
## 3 The Return … 510 183 2673 2 2459 268
The word count data is given in two untidy and gender-specific files available at these URLs:
https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/Female.csv
https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/Male.csv
Write an R script that reads them in and writes a single tidy data frame to file. Literally, reproduce the lotr_tidy data frame and the lotr_tidy.csv data file from above.
Write R code to compute the total number of words spoken by each race across the entire trilogy. Do it two ways:
Reflect on the process of writing this code and on the code itself. Which is easier to write? Easier to read?
Write R code to compute the total number of words spoken in each film. Do this by copying and modifying your own code for totalling words by race. Which approach is easier to modify and repurpose – the one based on multiple, untidy data frames or the tidy data?
When we configured git to connect with our GitHub account through RStudio, many of you may have used a PAT (personal access token) that was only valid for 30 days. Let’s revisit how we can set up with a new token by working through the procedures described in Lesson 2.