Data Analysis With ChatGPT: Promise and Pitfalls: A Short Course

An 8-Hour Livestream Seminar Taught by Stephen Vaisey, Ph.D.

One of ChatGPT’s newest abilities is its Advanced Data Analysis mode. This allows users to work interactively with ChatGPT to analyze their own data in Python without knowing any Python code themselves. In this mode, users can upload their own data and ask ChatGPT to do tasks like:

  • Describe a dataset.
  • Get descriptive statistics.
  • Perform data cleaning and data manipulation.
  • Estimate statistical models.
  • Create data visualizations.

This sounds impressive, and it is! However, users need to understand that this is like having a tireless research assistant who knows Python but does not know anything about you, your research, or the conventions of your particular field. This can lead to serious problems, misunderstandings, and errors if users aren’t careful and clear in their instructions.

This 8-hour, two-day course will be an introduction to both the promise and pitfalls of ChatGPT’s advanced data analysis mode. The instructor will work with you to understand best practices for using this mode, and help you decide under what conditions using it might be a good or bad idea.

Starting August 27, we are offering this seminar as an 8-hour synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two 2-hour lecture sessions which include hands-on exercises, separated by a 30-minute break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously. 

Closed captioning is available for all live and recorded sessions. Captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.

More Details About the Course Content

This course will combine lectures, demonstrations, and exercises to help participants learn about the capabilities and limitations of Advanced Data Analysis using ChatGPT. Because the exact outputs provided by ChatGPT will change from session to session, participants will receive a copy of the outputs from one “run” of the course prompts. This may be different from what occurs live during the lecture! We will also discuss how this non-repeatability matters for issues of scientific reproducibility and code sharing.

Computing

Advanced Data Analysis is only available for subscribers to ChatGPT Plus. See here for pricing. Participants in the course who want to use ChatGPT for data analysis must have their own ChatGPT Plus account already set up.

It is worth noting that there is a limit on the number of prompts users are allowed to send during a three-hour period. So users who want to follow along with the course may want to refrain from using ChatGPT for the three hours prior to each session.

Who Should Register?

This course is for anyone who is considering using ChatGPT’s Advanced Data Analysis model. Those who are already skilled in coding and statistics may get different things out of the course than those with little background, but the course will be valuable for both.

Seminar outline

1. Quick introduction to large language models (LLMs)

  • How do LLMs work?
  • What do I get different results every time?
  • What is Advanced Data Analysis mode?
  • Generating R and Python code for local use vs. Advanced Data Analysis mode

2. Loading and describing data

  • Data types (e.g., using multiple Excel sheets)
  • Best practices for data preparation
  • Comparing datasets
  • Missing data
  • Potential problems

3. Univariate statistics and visualizations

  • Means, SDs, percentiles, frequency tables
  • Histograms
  • Graph styling

4. Bivariate statistics and visualizations

  • T-test, correlation
  • Box plots, scatter plots
  • Refining a plot with verbal instructions

5. Multivariate statistics

  • Data transformation (e.g., logs, quadratics)
  • Regression and regression diagnostics
  • Factor and cluster analysis
  • The importance of custom instructions
  • The risks of automated model parameterization and selection

6. Using output

  • Saving code for replication
  • Downloading visualizations
  • Downloading transformed data
  • Running Python code locally

7. Conclusion

  • Summary of pros and cons
  • Who should–and shouldn’t–use this

Payment information

The fee of $695 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.

Seminar Information

Tuesday, August 27-
Wednesday, August 28, 2024

Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:30am-12:30pm (convert to your local time)
1:00pm-3:00pm

Payment Information

The fee of $695 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.