Data Wrangling with R: A Short Course

A 4-Day Livestream Seminar Taught by Kieran Healy, Ph.D.

Download Sample Course Slides

R is a free and open-source package for statistical analysis that is widely used in the social, health, physical, and computational sciences. R is powerful, flexible, and has excellent graphics capabilities. It also has a large and rapidly growing community of users.

Although there are a variety of approaches to working with data in R, in recent years, the “tidyverse” has emerged as a cohesive and consistent approach to the everyday tasks of data wrangling and analysis. The tidyverse is a suite of tools for data management, manipulation, analysis, and visualization within the R software environment for statistical computing. This seminar provides an intensive, hands-on introduction to using tidyverse tools for doing your own work.

Starting July 30, we are offering this seminar as a 4-day synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two lecture sessions which include hands-on exercises, separated by a 1-hour break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously. 

Closed captioning is available for all live and recorded sessions. Live captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.

More Details About the Course Content

The course is not focused on particular statistical methods or modeling techniques. Rather, we will learn how to accomplish everyday tasks that statistical analysis depends on but which are rarely taught in detail in their own right. These include topics such as getting your own data into R, exploring the structure of your data, recoding variables and reshaping tables, and presenting summary tabulations and graphs of this work.

Throughout the course we will emphasize how R and the tidyverse “thinks”. Every dataset is different, especially at the stage where it still needs further cleaning or arranging before it can be easily analyzed or effectively presented. This course will teach you the logic and implicit “flow of action” behind the tidyverse’s tools, giving you the ability to apply and extend this way of thinking when working with your own data and its particular challenges.

Computing

We will be working with the most recent stable versions of R and RStudio, as well as with a number of additional packages. You will need to install R, RStudio, and the necessary packages on your own computer. Basic familiarity with R is highly desirable, but even novice R coders should be able to follow the presentation and do the exercises.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations.

Who Should Register?

You should take this course if you are interested in answering questions like these:

  • How can I properly get my data into R?
  • How should I deal with different types of data?
  • How can I explore the structure of my data?
  • How can I manipulate, summarize, and tabulate my data?
  • How can I efficiently clean my data?
  • How can I reshape or reconfigure my data?
  • How can I quickly graph or report on my data?

The course does not presume any prior experience with R. However, if you are an R user and have been annoyed with questions like these:

  • How can I get these 50 CSV files into R?
  • Why can’t I get the right answer when summarizing this grouped data?
  • How can I tell R that my categorical measure is ordered?
  • How can I clean up this textual data?
  • How can I neatly calculate summary statistics for all the measures in my data?
  • How can I arrange this table to print in a nice way?
  • Why doesn’t the answer I found on Stack Overflow work properly?
  • Why does the answer I found on Stack Overflow work properly?
  • Why does R keep telling me “Object of type ‘closure’ is not subsettable”?

… then this course will be worthwhile for you, too.

Outline

1. Tidy data and the tidyverse

  • Motivation: plain-text data analysis
  • How R works and why it got that way
  • What’s “tidy” about the tidyverse?
  • Pipelining your code
  • A first example

2. Getting your data into R with readr

  • Reading in a single table of data
  • Tibbles
  • Data types
  • Common pitfalls and problems

3. Tabulating and summarizing data with dplyr

  • Filtering, selecting, mutating, and summarizing a single table
  • Manipulating column names and arranging rows
  • Groups and the logic of working with grouped data
  • Calculating on the columns of a table, and on the rows
  • Zero counts in dplyr and other gotchas

4. Reshaping data with tidyr

  • Moving back and forth between wide and long data
  • Splitting, separating, and recoding observations
  • Managing and visualizing missing values
  • Expanding and completing datasets

5. Managing categorical measures and textual data with forcats and stringr

  • Working with factors in R and in the tidyverse
  • Recoding and re-leveling factors
  • String manipulation, regular expressions, and stringr

6. Iterating on data with dplyr and purrr

  • Relational data in dplyr
  • Joining tables
  • Working across() columns
  • Using map() and its friends to feed your data to functions

7. Modeling with broom

  • Extending tidy principles to models
  • Fitting and summarizing model output

8. Making it easier to be tidy

  • The janitor package helps clean your data
  • Working with the usethis and reprex helper packages

9. Managing your clean data

  • Documenting your data
  • Using a package to store your data
  • The wider world of tidyverse-friendly packages and tools

Reviews of Data Wrangling with R

“This course was extremely well presented and organized. The professor did a great job presenting the material in a way that was both digestible and practical. He was also great about answering all questions. This was a fantastic course that I got a lot out of and am highly recommending to my colleagues and our agency director.”
  Lindsay Bostwick, DOJ/OJP/Bureau of Justice Statistics

“I appreciated that the emphasis was, as indicated, on handling and cleaning data and making tables and the like. I especially liked the “never copy-paste; you have a computer to do so” sentiment. I am rather handy at Stata, I teach it, but the switch to R is not so easy and quick to make. This was exactly the course I was looking for the past two years! Kieran Healy is a perfect teacher. He knows what students need and what researchers do very often, and the excellent documentation will do the rest. This was a perfect course!”
  Yolanda Grift, Utrecht University

“I liked that the materials provided (code, slides, etc.) made it easy to follow along.”
  Kris Andersen, Oslo University Hospital

“I found this workshop extremely useful for several reasons. Dr. Healy is a very knowledgeable, engaged instructor: approachable, responsive, and enjoys teaching. He brings his research and real-life examples to the classroom setting. This is my second workshop with him. I’ve taken nearly a dozen workshops delivered by Statistical Horizons and he is one of best instructors.”
  Towhid Islam, University of Guelph

“I liked the way R was introduced as a new but familiar language. Since I had previously used R in school, it was a good refresher. The course was a good start to diving back into the R syntax.”
  Rika Alavi, Kaiser Permanente Northern California

“The live examples were incredibly useful, as was the way Kieran Healy baked into his presentation the philosophy of tidy coding with concrete examples from the real world. The course material availability was sterling. The recorded sessions have been very handy for rehashing some of the material between sessions and after the course.”
  Jessica Louise Ray, NORCE Norwegian Research Centre AS

“The course is what I was exactly looking for, knitting the documents and writing the code. Super helpful for my current work.”
  Sneha Yeddala, OHSU/ Center for Evidence-based Policy

“This is an excellent course. I really appreciate that Kieran assumed no previous experience with R. Also, as a beginner, I really appreciated the workshops during the first day, which focused not just on syntax, but also on how R works conceptually and how it understands user input. I have already recommended to my supervisor that other data analysts in our organization attend these workshops.”
  Catlin Nchako, Center on Budget and Policy Priorities

Seminar Information

Tuesday, July 30 –
Friday, August 2, 2024

Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:30am-12:30pm (convert to your local time)
1:30pm-3:00pm

Payment Information

The fee of $995 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.