Ch. 12 Tidy Data, in R for Data Science by Grolemund & Wickham
This cool Twitter thread by Julia Lowndes. If you can’t access the thread or want more detail, the same material is expanded upon on Julie’s Openscapes website. Openscapes is an awesome organization that champions open practices in environmental science - check it out!
gather()functions so don’t get confused by the code.
tidyr[vignette on tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
So far, we’ve only worked with data that were already formatted for efficient processing with tidyverse functions. In this session we’ll learn some tools to help get data into that format - make it tidy and more coder-friendly.
By the end of today’s class, you should be able to:
tidyr::pivot_longer()to reshape data frames
tidyr::separate()to merge or separate information from different columns
Todays lesson integrates material from multiple sources, including the excellent R for Excel users course by Julia Stewart Lowndes and Allison Horst and several other sources specified below.
Open the R Project associated with your personal class GitHub repository.
PULL to make sure your project is up to date
Create a new R script file and save it as
Load the packages we’ll use here (you will have to install
janitor if you don’t already have it):
# Load packages library(tidyverse) library(janitor) ## install.packages("janitor")
“Tidy” might sound like a generic way to describe non-messy looking data, but it actually refers to a specific data structure.
A data set is tidy if:
An implication of this definition is that each value belongs to exactly one variable and one observation. This also means that tidy data is relative, as it depends on how you define your observational unit and variables.
A key idea here is that instead of building your analyses around whatever (likely weird) format your data are in, take deliberate steps to make your data tidy. When your data are tidy, you can use a growing assortment of powerful analytical and visualization tools instead of inventing home-grown ways to accommodate your data. This will save you time since you aren’t reinventing the wheel, and will make your work more clear and understandable to your collaborators (most importantly, Future You).
Note that to effectively use
ggplot() your data must be in tidy format. It also makes it easier to take advantage of R’s vectorized nature (most built-in R functions work with vectors of values)
Review of the beautiful slides by Julia Lowndes and Allison Horst for a clear overview of the motivation for working with tidy data.
Let’s go through some examples to get a better understanding of what tidy data look like.
Let’s first compare the four different representations of the same dataset shown in Chapter 12.2 in Grolemund and Wickham’s “R for Data Science” including the same values of four variables country, year, population, and cases.
Which of these representations are in tidy format?
The datasets discussed in Chapter 12 of for Data Science come with the
tidyverse package, so you can access them by just typing their names, e.g.
table2 etc. We computed the infection rate per year 10,000 people for each year and each country from
table1 together using our now familiar
dplyr wrangling tools.
Your task: Now calculate this same statistic from
table4b. You will need to perform four operations:
Which representation is easiest to work with? Which is hardest? Why?
If I had one thing to tell biologists learning bioinformatics, it would be “write code for humans, write data for computers”. — Vince Buffalo (@vsbuffalo)
Now that the motivation for wanting to work with data in tidy format hopefully is clear, let’s explore some powerful functions from the package
tidyr for reshaping data. (
tidyr comes bundles with the
tidyverse, so we don’t have to install it separately).
Often, datasets will not be in tidy format because they are organized to facilitate some use other than analysis. For example, data is often organized to make entry or reading by humans as easy as possible.
This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr:
We’ll walk through an illustration of how to use these by following Chapter 12.3 Pivoting in Grolemund and Wickham’s “R for Data Science”.
That overview uses a simple example with a small number of variables, so we can easily list them individually when we use
pivot_longer(). For datasets with more variables, we can use more automated ways to index columns, the same helper functions we used for the
select() function (you can refresh your memory here). See examples of this in the useful
tidyr vignette on pivoting.
After reviewing the
pivot functions, let’s continue on to also take a look at Chapter 12.4 in R for Data Science on
unite() - two simple functions for splitting and combining information from different columns.
To explore tidy data in a different context, let’s work through a tutorial developed by Jenny Bryan using data on the Lord of the Rings movies. This nicely illustrates the concepts of lengtening and widening datasets. It uses outdated functions for pivoting the dataframes, however, so we’ll work through updated code here (i.e. only look at the
01-intro.md file, not the
03-spread.md - we’ll work through those steps here).
First let’s read the intro (
Then let’s work through reshaping the data.
First, we bring the data into data frames or tibbles, one per film, and do some inspection.
fship <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv") ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv") rking <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")
We now have one data frame per film, each with a common set of 4 variables. Step one in tidying this data is to glue them together into one data frame, stacking them up row wise. This is called row binding and we use
lotr_untidy <- bind_rows(fship, ttow, rking) lotr_untidy
## # A tibble: 9 x 4 ## Film Race Female Male ## <chr> <chr> <dbl> <dbl> ## 1 The Fellowship Of The Ring Elf 1229 971 ## 2 The Fellowship Of The Ring Hobbit 14 3644 ## 3 The Fellowship Of The Ring Man 0 1995 ## 4 The Two Towers Elf 331 513 ## 5 The Two Towers Hobbit 0 2463 ## 6 The Two Towers Man 401 3589 ## 7 The Return Of The King Elf 183 510 ## 8 The Return Of The King Hobbit 2 2673 ## 9 The Return Of The King Man 268 2459
We are still violating one of the fundamental principles of tidy data. “Word count” is a fundamental variable in our dataset and it’s currently spread out over two variables,
Male. Conceptually, we need to gather up the word counts into a single variable and create a new variable,
Gender, to track whether each count refers to females or males. We use the
pivot_longer() function from the tidyr package to do this.
lotr_tidy <- pivot_longer(lotr_untidy, c(Male, Female), names_to = 'Gender', values_to = 'Words') lotr_tidy
## # A tibble: 18 x 4 ## Film Race Gender Words ## <chr> <chr> <chr> <dbl> ## 1 The Fellowship Of The Ring Elf Male 971 ## 2 The Fellowship Of The Ring Elf Female 1229 ## 3 The Fellowship Of The Ring Hobbit Male 3644 ## 4 The Fellowship Of The Ring Hobbit Female 14 ## 5 The Fellowship Of The Ring Man Male 1995 ## 6 The Fellowship Of The Ring Man Female 0 ## 7 The Two Towers Elf Male 513 ## 8 The Two Towers Elf Female 331 ## 9 The Two Towers Hobbit Male 2463 ## 10 The Two Towers Hobbit Female 0 ## 11 The Two Towers Man Male 3589 ## 12 The Two Towers Man Female 401 ## 13 The Return Of The King Elf Male 510 ## 14 The Return Of The King Elf Female 183 ## 15 The Return Of The King Hobbit Male 2673 ## 16 The Return Of The King Hobbit Female 2 ## 17 The Return Of The King Man Male 2459 ## 18 The Return Of The King Man Female 268
Tidy data… mission accomplished!
To explain our call to
pivot_longer() above, let’s read it from right to left: we took the variables
Male and gathered their values into a single new variable
Words. This forced the creation of a companion variable
Gender, which tells whether a specific value of
Words came from Female or Male. All other variables, such as
Film, remain unchanged and are simply replicated as needed.
Now we write this multi-film, tidy dataset to file for use in various downstream scripts for further analysis and visualization.
write_csv(lotr_tidy, path = "../datasets/lotr_tidy.csv")
## Warning: The `path` argument of `write_csv()` is deprecated as of readr 1.4.0. ## Please use the `file` argument instead.
After tidying the data and completing your analysis, you may want to output a table that has each race in its own column. Let’s use the
pivot_wider() function to make such a table and save it as “lotr_wide”
OPTIONAL: Use the pivot_longer() function to transform you lotr_wide back to tidy format.
# let's get one variable per Race lotr_tidy %>% pivot_wider(names_from = Race, values_from = Words)
## # A tibble: 6 x 5 ## Film Gender Elf Hobbit Man ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 The Fellowship Of The Ring Male 971 3644 1995 ## 2 The Fellowship Of The Ring Female 1229 14 0 ## 3 The Two Towers Male 513 2463 3589 ## 4 The Two Towers Female 331 0 401 ## 5 The Return Of The King Male 510 2673 2459 ## 6 The Return Of The King Female 183 2 268
# let's get one variable per Gender lotr_tidy %>% pivot_wider(names_from = Gender, values_from = Words)
## # A tibble: 9 x 4 ## Film Race Male Female ## <chr> <chr> <dbl> <dbl> ## 1 The Fellowship Of The Ring Elf 971 1229 ## 2 The Fellowship Of The Ring Hobbit 3644 14 ## 3 The Fellowship Of The Ring Man 1995 0 ## 4 The Two Towers Elf 513 331 ## 5 The Two Towers Hobbit 2463 0 ## 6 The Two Towers Man 3589 401 ## 7 The Return Of The King Elf 510 183 ## 8 The Return Of The King Hobbit 2673 2 ## 9 The Return Of The King Man 2459 268
# let's get one variable per combo of Race and Gender lotr_tidy %>% unite(Race_Gender, Race, Gender) %>% pivot_wider(names_from = Race_Gender, values_from = Words)
## # A tibble: 3 x 7 ## Film Elf_Male Elf_Female Hobbit_Male Hobbit_Female Man_Male Man_Female ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 The Fellows… 971 1229 3644 14 1995 0 ## 2 The Two Tow… 513 331 2463 0 3589 401 ## 3 The Return … 510 183 2673 2 2459 268
The word count data is given in two untidy and gender-specific files available at these URLs:
Write an R script that reads them in and writes a single tidy data frame to file. Literally, reproduce the lotr_tidy data frame and the lotr_tidy.csv data file from above.
Write R code to compute the total number of words spoken by each race across the entire trilogy. Do it two ways:
Reflect on the process of writing this code and on the code itself. Which is easier to write? Easier to read?
Write R code to compute the total number of words spoken in each film. Do this by copying and modifying your own code for totalling words by race. Which approach is easier to modify and repurpose – the one based on multiple, untidy data frames or the tidy data?