NTRES 6100
Collaborative and Reproducible Data Science in R
Cornell University, Fall 2024
Lectures: Tuesdays and Thursdays 10:10am - 11:25am
(August 27 - November 5, 2024), Morrison Hall 163
Optional lab sessions: Thursdays or
Fridays 12:20pm - 2:15pm, Kennedy Hall 101
Instructor: Associate Professor Nina Overgaard Therkildsen (nt246@cornell.edu)
TA: PhD Student Jaime Ortiz Pachar (jdo53@cornell.edu)
Office hours: Nina: by appointment; Jaime: Wednesdays, 10 - 11 am in Fernow 311 or by appointment
Grading: S/U (2 credits / 3 credits with lab)
As datasets grow larger and more complex across all areas of science, computational skills are increasingly in high demand. This course introduces a series of practical tools that enable researchers to spend less time wrestling with software or repeating error-prone manual data processing and more time getting research done in efficient and transparent ways that facilitate collaboration and reproducibility. We will work in R/RStudio, primarily with the tidyverse packages and with Git and GitHub integration. Topics covered include 1) tidy data formatting, 2) rearrangement, filtering, exploration, and visualization of complex datasets, 3) basic programming for building and automating custom tools, 4) tracking the history of file changes (version control) with Git and GitHub, 5) strategies for effective collaboration on data processing pipelines, and 6) using R Markdown to combine text, equations, code, tables, and figures into reports, websites, and presentations. The course emphasizes practical skill development and will be structured around hands-on (the keyboard) learning.
By the end of this course, students will be able to:
A basic working knowledge of R will be helpful, but no prior experience with the tidyverse packages or with Git, GitHub, or R Markdown is assumed. If you have never worked in R before, we recommend working through one or more of the following tutorials before the course:
The two weekly lectures will introduce new concepts and provide opportunities to practice through hands-on exercises. To participate effectively, you must have completed the assigned readings prior to class. Each Thursday, we will assign a problem set that applies the concepts covered in class in a new context to reinforce your learning. The problem sets are due the following Thursday at 10pm. We offer two optional lab sessions on Thursdays and Fridays for more opportunities to practice in groups and with TA support; the Thursday and Friday sessions are identical and you can attend either one of them.
It takes practice to acquire and internalize data science skills, and what you get out of this course will be proportional to the effort you put in. Practice as much as you can. To pass, students are expected to attend all lectures (and lab sessions if enrolled), participate actively during class, submit at least 7 of the 9 problem sets with demonstrated effort to complete all questions, and give a brief (~2 minute) presentation at the end of the course about how you might adopt some of the course material in your own work. If you are unable to make a lecture or can not meet a problem set deadline, please let the instructor and TA know on Slack beforehand. If you are registered in one of the lab sessions (one extra credit), you are also expected to participate in lab activities in at least 7 of the 9 lab sessions.
All assigned readings are freely available online and will be linked to from the course website. We will draw from a variety of sources, primarily Wickham, Çetinkaya-Rundel, and Grolemund’s R For Data Science and the STAT545 course developed by Jenny Bryan.
All students will need to bring a laptop to each session. Students who do not have their own laptop can arrange to borrow one from the Mann Library.
Please follow the instructions here to install the software we will need prior to the first class.
We are dedicated to providing a welcoming and supportive environment for everyone, regardless of background, identity and prior experience level. Everyone in this course will be coming from a different place with different experiences and expectations. We will not tolerate any form of language or behavior used to exclude, intimidate, or cause discomfort. This applies to all course participants (instructor, students, guests). In order to foster a positive and professional learning environment, we encourage the following kinds of behaviors:
In compliance with the Cornell University policy and equal access laws, we are available to discuss appropriate academic accommodations that may be required for student with disabilities. Requests for academic accommodations are to be made during the first two weeks of the course, except for unusual circumstances, so arrangements can be made. Students are encouraged to register with Student Disability Services to verify their eligibility for appropriate accommodations.
This semester, the course is offered fully in-person and we expect you to show up for class. However, in an effort to accommodate special needs and keep everyone safe and healthy, we will also provide a Zoom link for joining lectures online when you are not able to participate in-person. You can use the Zoom link posted on the the course Canvas site instead of showing up in person if you are sick, or have another reasonable justification. However, we are not able to accommodate fully hybrid participation, and we strongly encourage you to participate in-person whenever possible. Online participation will not be possible for the lab sessions.
Subject to adjustment
Lab# | Date (Thu) | Date (Fri) | Topic |
---|---|---|---|
1 | 8/29 | 8/30 | RMarkdown |
2 | 9/5 | 9/6 | RMarkdown and GitHub |
3 | 9/12 | 9/13 | Displaying data visualization on a website |
4 | 9/19 | 9/20 | Data exploration with the gapminder dataset |
5 | 9/26 | 9/27 | Data exploration with the Titanic dataset |
6 | 10/3 | 10/4 | Data cleaning and tidy data |
7 | 10/10 | 10/11 | Relational data and tidy data |
8 | 10/17 | 10/18 | Iteration and conditional execution |
9 | 10/24 | 10/25 | Functions and iterations |
10 | 10/31 | 11/1 | OPTIONAL: Bring your own project |