Data Cleaning: A Short Course

A 4-Day Livestream Seminar Taught by Bianca Manago, Ph.D.

Download Sample Course Slides

The management and cleaning of data is essential to the integrity of research findings. Unfortunately, there is little formal focus on how to approach data cleaning. For example, when someone says “you need to clean the data” what exactly does that mean? What steps are involved and in what order? How do we decide what needs to be done?

Data cleaning involves all the steps that occur between data collection and analysis (e.g., merging, appending, labeling, data analytics, cross-validation, constructing/re-constructing variables for analysis, identifying missing data). This seminar will cover all these topics.

This seminar will also provide a more general framework for approaching data cleaning which is rooted in a desire to represent data both accurately and fully. This framework informs decisions about an ideal order in which data cleaning should be conducted. This framework also delves into some of the trickier issues. For example, when you come across anomalous, vague, or missing data – what kinds of things should you consider? I will also provide guidance for ensuring that, once you make a decision, your findings are reproducible.

Starting June 13, we are offering this seminar as a 4-day synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two lecture sessions which include hands-on exercises, separated by a 1-hour break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously. 

Closed captioning is available for all live and recorded sessions. Live captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.

More Details About the Course Content

After taking this seminar, you will have the skills to:

  • Clean different types of data step-by-step.
  • Compile data.
  • Label and name variables.
  • Examine data.
  • Solve data cleaning dilemmas.
  • Alter variables.
  • Incorporate new variables.
  • Re-configure data.
  • Re-examine data.
  • Identify missing data.*
  • Document and present your data.

*While this seminar will help you to identify missing data, it does not spend considerable time on how to address missing data. If you’d like to learn more about methods for handling missing data, check out Dr. Paul Allison’s Missing Data seminar.

Computing

The empirical examples and exercises in this course will emphasize Stata, but there will be equivalent code and examples presented/available for R. To fully benefit from the course, you should have access to a computer loaded with R or Stata. Whichever package you choose, you should already have a working understanding of the software and be able to complete basic functions in the software.

If you’d like to familiarize yourself with Stata basics before the seminar begins, we recommend following along with a “getting started” video like the one here.

Seminar participants who are not yet ready to purchase Stata could take advantage of StataCorp’s free 30-day evaluation offer or their 30-day software return policy.

If you’d like to use R for this course but don’t yet have much experience with that package, here are some excellent on-line resources for building your R skills.

Who Should Register?

This course is for anyone who works with data and wants to improve their data management practices. It will be useful for graduate students, junior scholars, and also provide insights to seasoned researchers.

Outline

Day 1: Getting Started with Data Cleaning

  • Introduction to data cleaning
    • Types of data
    • Steps of data cleaning
    • Documentation
  • Compiling data
    • Downloading/collecting
    • Merging
    • Appending
    • Reshaping

Day 2: Examining Data Quality

  • Labeling and naming variables
    • Best practices for variable
  • Examining data
    • Detecting anomalous data
      • Univariate statistics/graphs
      • Bivariate statistics/graphs
    • Detecting missing data
    • Creating indicator variables

Day 3: Recoding & Creating Variables

  • Examining data, continued
    • Data cleaning dilemmas
  • Altering variables
    • Recoding variables
    • Transforming variables
  • New variables
    • Scale construction
    • Substantive variable combination

Day 4: Final steps for preparing data for analysis

  • Re-configuring data
  • Re-examining data
  • Documentation and presentation of data cleaning
  • Summary and conclusion

Reviews of Data Cleaning

“Dr. Manago is very knowledge and presented the course in a way that made sense. We often learn statistics in pieces, so it was refreshing for her to present in the order that we would need to follow in the data cleaning process. The instructor provided so much detail about parts of the data cleaning process I had never learned before. It was refreshing, fun, and full of material to guide me in the data cleaning process.” 
  Sara Bryson, East Carolina University 

“The tips for organizing the cleaning steps were great, as were the applied examples.”
   Monique Ernst, National Institute of Health

“As someone who does not have much experience with R, I was surprised that I wasn’t lost or confused during the lecture. Dr. Manago gave a lot of useful information, some of which might not be valuable to me right now in this early stage of my career but I’m sure will be more valuable later in my career. I will certainly be looking back on these slides for guidance during future projects.”
  Mennefer Blue, Rush University Medical Center

Seminar Information

Tuesday, June 13 –
Friday, June 16, 2023

Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:30am-12:30pm (convert to your local time)
1:30pm-3:00pm

Payment Information

The fee of $995 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.