GitHub for Data Analysis

A 3-Day Remote Seminar Taught by Aaron Gullickson, Ph.D.

 

Download a sample of the course materials

DOWNLOAD

Git is a free, open-source distributed version control system that is used by programmers and data analysts to track project progress efficiently, code without fear of error, and collaborate sanely. Although version control was originally developed for software development, data scientists have adopted its use to facilitate efficient project management and to easily disseminate research materials (such as code) to broader communities.

GitHub, a website that provides online open-access git repositories, has emerged as a leading choice for data analysts and researchers seeking to collaborate and share projects using git. GitHub provides a variety of additional features and workflows that improve the experience of using git.

This seminar will familiarize you with using git through GitHub and demonstrate how to integrate GitHub into a research workflow. We will teach you the basic git workflow and how to use git and GitHub to simplify research collaboration.

Starting September 30, we are offering this seminar as a 3-day synchronous*, remote workshop for the first time. Each day will consist of a 4-hour live lecture held via the free video-conferencing software Zoom. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later if you are unable to attend at the scheduled time.

Each lecture session will conclude with a hands-on exercise reviewing the content covered, to be completed on your own. An additional lab session will be held Thursday and Friday afternoons, where you can review the exercise results with the instructor and ask any questions.

*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.

Closed captioning is available for all live and recorded sessions.

More Details About the Course Content

The course will introduce you to the basic workflow of git including how to commit, push, and pull changes to underlying research material and how to create and clone repositories through GitHub. You will also learn how to create separate branches of code for saner collaboration and how to merge branches using GitHub pull requests.

You will learn both command line tools for working with git and several tools for using git through a GUI interface. We will use RStudio to demonstrate how to use git to manage a project, but the principles of using git apply broadly to any statistical software package that uses scripting.

The seminar will be very hands on and you will learn how to create and manage your own remote repositories through GitHub. You are welcome to bring projects to the course for which you would like to construct GitHub repositories.

Computing

In order to participate in the hands-on exercises and to follow along in the class, you will need to have git installed on your computer. Git is free, open source, and available on Windows, Mac, and Linux platforms. Window users will also need to use the Git Bash application (installed automatically with git) for command line operations. You will also need to create a free account on GitHub.

We will also make use of additional GUI clients that can make it easier to work with git. You are also encouraged to download and install R and RStudio, the Atom text editor, and Git Kraken. All of these applications are free and available on Windows, Mac, and Linux platforms.

Who Should Register?

This course is for anyone who wants to improve their statistical research workflow and learn to easily collaborate on research and share the products of that research. The principles learned in this course can be applied broadly to working in any statistical or coding environment.

Outline

Day 1: The Basic Git Workflow

  • What is version control and why should you use it?
  • Setting up git via the command line
  • Time to Commit: Working with a local repository
  • Push and Pull: Working with a remote repository
  • Using Git GUI Clients
    • Setting up an R Project in RStudio
    • Using Atom to write and git
    • Git using Git Kraken
  • Activity: Setting up your first repository on GitHub

Day 2: Collaborating with Others

  • Pushing and pulling with collaborators
  • Resolving (git) conflicts
  • Branching for sanity
  • The GitHub pull request
  • Activity: Create a branch and pull request

Day 3: Dealing with Complications

  • Ignorance is bliss: the .gitignore file
  • Going back in time: reverting your work
  • Forking from GitHub
  • Working with large files
  • Writing papers with git
  • Activity: Fork an interesting GitHub repository

Seminar information

Thursday, September 30, 2021 –
Saturday, October 2, 2021

Each day will follow this schedule:

10:00am-2:00pm ET: Live lecture via Zoom

4:00pm-5:00pm ET: Live lab session via Zoom (Thursday and Friday only)

Payment Information

The fee of $895 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.