Workflow of Data Analysis Using R: A Short Course

A 3-Day Livestream Seminar Taught by Bianca Manago, Ph.D.

Download Sample Course Slides

Statistical analyses are only as good as the data that go into them. This is why the majority of time on any data analysis project should be spent, not on conducting the analyses (i.e., actually running the model), but instead on the steps needed to prepare the data for analysis. There are dozens of decisions that go into data management.  If not properly documented or considered, those decisions can produce erroneous results or preclude replication.

This seminar is designed to teach researchers how to prepare data for analysis in a way that is both accurate and replicable. By following these principles, your data analytic projects will be both well-planned and executed. The scope of the seminar ranges from such broad topics as developing research plans to the detailed minutia of planning variable names.

Starting April 7, we are offering this seminar as a 3-day synchronous*, livestream workshop. Each day will consist of a 4-hour live lecture held via the free video-conferencing software Zoom. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

Each day will include a hands-on exercise to be completed on your own after the lecture session is over. An additional lab session will be held Thursday and Friday afternoons, where you can review the exercise results with the instructor and ask any questions.

*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.

Closed captioning is available for all live and recorded sessions.

More Details About the Course Content

This seminar is for researchers who are trying to establish or improve their workflow. I do not expect participants to be expert programmers; this seminar should be accessible to very novice R users, while still being useful to more advanced users. Lessons from this seminar balance ease of use with proper functioning, introducing researchers to useful tools, e.g., dual-pane browsers, macro programs, plain text editors, RStudio, and GitHub. For those who are already familiar with these tools, this seminar will teach you how to optimize them. Lessons from this seminar should make conducting research less painful, more efficient, more accurate, and reproducible.

This is a hands-on seminar with ample opportunities to plan and practice your workflow.

Some highlights include:

  • Planning (analyses, sensitivity analyses, variable construction, etc.)
  • Organizing files using a standardized directory structure
  • Preserving data and findings
  • Effectively documenting findings, data sources, cleaning methods
  • Separating data management and analyses using dual workflow
  • Writing robust script files
  • Naming variables
  • Labeling variables and values
  • Creating research that is both reproducible and replicable
  • Examining data quality

Computing

The empirical examples and exercises in this course will emphasize R. To fully benefit from the course, you should use your own computer with R installed. You should also download and install RStudio, a front-end for R that makes it easier to work with. This software is free and available for Windows, Mac, and Linux platforms. For those who prefer Stata, equivalent Stata code will be provided on request.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent on-line resources for learning the basics. Here are our recommendations.

Who Should Register?

This course is for anyone who wants to improve the efficiency and accuracy of their data analysis and presentation. You should have experience with data analysis, as well as familiarity with the R programming language.

Outline

PART 1: INTRODUCTION TO WORKFLOW

  1. What is “workflow”?
  2. Why care about WF?
  3. WF and replication
  4. Steps in and principles of WF

PART 2: PLAN, ORGANIZE, DOCUMENT, AND PRESERVE

  1. Planning research projects in the:
    a. Large (overall questions, project checklist, and timeline)
    b. Middle (data cleaning, analyses, tables, and figures)
    c. Small (naming variables, naming files, value labels, and order of
    analyses/cleaning)
  2. Organizing files and folders
  3. Documentation
  4. Preserving data and preventing loss
  5. Replication

PART 3: SCRIPT FILES IN R

  1. Strengths and weaknesses of R for workflow
  2. Dual workflow
  3. Robust script files
  4. Legible script files
  5. Automation in script files

PART 4: CLEANING, LABELING, & MISSING DATA

  1. Naming and labeling variables
  2. Missing data
  3. Merging data
  4. Verifying data

PART 5: ANALYZING & PRESENTING FINDINGS

  1. Principles of data analysis
  2. Documenting provenance
  3. The posting principle
  4. Presenting findings

PART 6: COLLABORATION

  1. Key factors in collaboration
  2. Introducing workflow with co-authors
  3. Coordinating workflow with multiple authors

Reviews of Workflow of Data Analysis Using R

“This should be required learning for all researchers. The course covers the principles/rationale for having good workflow as well as concrete steps you can do right away to improve your own workflow. I was able to immediately implement steps that I know will make me a better researcher. I also have the resources I need to tackle larger changes to my workflow in the coming weeks. It’s been 5 years since I completed my PhD and I wish I had taken this course in my first year. I left this course feeling excited and empowered.”
     Marta Mulawa, Duke University

“Great class! Bianca Manago is a superb teacher; positive, enthusiastic, extremely knowledgeable, clear, and responsive and helpful to students. I gained extremely valuable insights, principles, tips, and practical strategies that will improve the quality, reproducibility, and replicability of all my research.”
     Ken Coburn, Health Quality Partners (HQP)

“Bianca was a great, knowledgeable, entertaining teacher.  She really made what could have been very dry material seem exciting. The knowledge I gained in this course will revolutionize the way I approach my research projects.”
  Laura Prichett, Johns Hopkins

“This was a very informative seminar. I would suggest it for all data managers, analysts, and study team members.  There are many practices suggested that would beneficial to a study team.”
  Angela Green, Johns Hopkins University

Seminar Information

Thursday, April 7, 2022 –
Saturday, April 9, 2022

Each day will follow this schedule:

10:00am-2:00pm ET: Live lecture via Zoom

4:00pm-5:00pm ET: Live lab session via Zoom (Thursday and Friday only)

Payment Information

The fee of $895 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.