A 3-Day Remote Seminar Taught by Bianca Manago, Ph.D.
The management and cleaning of data is essential to the integrity of research findings. Unfortunately, there is little formal focus on how to approach data cleaning. For example, when someone says “you need to clean the data” what exactly does that mean? What steps are involved and in what order? How do we decide what needs to be done?
Data cleaning involves all the steps that occur between data collection and analysis (e.g., merging, appending, labeling, data analytics, cross-validation, constructing/re-constructing variables for analysis, identifying missing data). This seminar will cover all these topics.
This seminar will also provide a more general framework for approaching data cleaning which is rooted in a desire to represent data both accurately and fully. This framework informs decisions about an ideal order in which data cleaning should be conducted. This framework also delves into some of the trickier issues. For example, when you come across anomalous, vague, or missing data – what kinds of things should you consider? I will also provide guidance for ensuring that, once you make a decision, your findings are reproducible.
Starting December 9, we are offering this seminar as a 3-day synchronous*, remote workshop for the first time. Each day will consist of a 4-hour live lecture held via the free video-conferencing software Zoom. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later if you are unable to attend at the scheduled time.
Each lecture session will conclude with a hands-on exercise reviewing the content covered, to be completed on your own. An additional lab session will be held Thursday and Friday afternoons, where you can review the exercise results with the instructor and ask any questions.
*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.
Closed captioning is available for all live and recorded sessions.
The empirical examples and exercises in this course will emphasize Stata, but there will be equivalent code and examples presented/available for R. To fully benefit from the course, you should bring your own laptop loaded with R or Stata. Whichever package you choose, you should already have a working understanding of the software and be able to complete basic functions in the software.
If you’d like to use R for this course but don’t yet have much experience with that package, here are some excellent on-line resources for building your R skills.
Who Should Register?
Day 1: Getting Started with Data Cleaning
Introduction to data cleaning
- Types of data
- Steps of data cleaning
Labeling and naming variables
Day 2: Examining Data Quality
- Detecting anomalous data
- Univariate statistics/graphs
- Bivariate statistics/graphs
- Detecting missing data
- Creating indicator variables
Data cleaning dilemmas
- Recoding variables
- Transforming variables
Day 3: Recoding & Creating Variables
- Scale construction
- Substantive variable combination
Documentation and presentation
Summary and conclusion
Thursday, December 9, 2021 –
Saturday, December 11, 2021
Each day will follow this schedule:
10:00am-2:00pm ET: Live lecture via Zoom
4:00pm-5:00pm ET: Live lab session via Zoom (Thursday and Friday only)
The fee of $895 includes all course materials.
PayPal and all major credit cards are accepted.
Our Tax ID number is 26-4576270.