Data Cleaning

A 3-Day Remote Seminar Taught by Bianca Manago, Ph.D.

The management and cleaning of data is essential to the integrity of research findings. Unfortunately, there is little formal focus on how to approach data cleaning. For example, when someone says “you need to clean the data” what exactly does that mean? What steps are involved and in what order? How do we decide what needs to be done?

Data cleaning involves all the steps that occur between data collection and analysis (e.g., merging, appending, labeling, data analytics, cross-validation, constructing/re-constructing variables for analysis, identifying missing data). This seminar will cover all these topics.

This seminar will also provide a more general framework for approaching data cleaning which is rooted in a desire to represent data both accurately and fully. This framework informs decisions about an ideal order in which data cleaning should be conducted. This framework also delves into some of the trickier issues. For example, when you come across anomalous, vague, or missing data – what kinds of things should you consider? I will also provide guidance for ensuring that, once you make a decision, your findings are reproducible.

Starting December 9, we are offering this seminar as a 3-day synchronous*, remote workshop for the first time. Each day will consist of a 4-hour live lecture held via the free video-conferencing software Zoom. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later if you are unable to attend at the scheduled time.

Each lecture session will conclude with a hands-on exercise reviewing the content covered, to be completed on your own. An additional lab session will be held Thursday and Friday afternoons, where you can review the exercise results with the instructor and ask any questions.

*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.

Closed captioning is available for all live and recorded sessions.


The empirical examples and exercises in this course will emphasize Stata, but there will be equivalent code and examples presented/available for R. To fully benefit from the course, you should bring your own laptop loaded with R or Stata. Whichever package you choose, you should already have a working understanding of the software and be able to complete basic functions in the software.

Seminar participants who are not yet ready to purchase Stata could take advantage of StataCorp’s free 30-day evaluation offer or their 30-day software return policy.

If you’d like to use R for this course but don’t yet have much experience with that package, here are some excellent on-line resources for building your R skills.

Who Should Register?

This course is for anyone who works with data and wants to improve their data management practices. It will be useful for graduate students, junior scholars, and also provide insights to seasoned researchers.


Day 1: Getting Started with Data Cleaning

Introduction to data cleaning

  • Types of data
  • Steps of data cleaning
  • Documentation

Compiling data

  • Downloading/collecting
  • Merging
  • Appending
  • Reshaping

Labeling and naming variables

Day 2: Examining Data Quality

Examining data

  • Detecting anomalous data
    • Univariate statistics/graphs
    • Bivariate statistics/graphs
  • Detecting missing data
  • Creating indicator variables

Data cleaning dilemmas

Altering variables

  • Recoding variables
  • Transforming variables

Day 3: Recoding & Creating Variables

New variables

  • Scale construction
  • Substantive variable combination

Re-configuring data

Re-examining data

Documentation and presentation

Summary and conclusion

Seminar information

Thursday, December 9, 2021 –
Saturday, December 11, 2021

Each day will follow this schedule:

10:00am-2:00pm ET: Live lecture via Zoom

4:00pm-5:00pm ET: Live lab session via Zoom (Thursday and Friday only)

Payment Information

The fee of $895 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.