Text as Data

A 4-Day Remote Seminar Taught by Brandon Stewart, Justin Grimmer, and Molly Roberts

We live in an era of data abundance: never before has so much been so easy to acquire. Social scientists and industry practitioners alike are left with a new problem: how to analyze the now readily accessible mountains of information. A burgeoning array of algorithms, statistical methods, and research designs make analysis of this information possible. These new forms of data and new statistical techniques provide opportunities to observe behavior that was previously unobservable, to measure quantities of interest that were previously unmeasurable, and to test hypotheses that were previously impossible to test.

This seminar will overview the field of “Text as Data” with an emphasis on making inferences with social data. The course is organized around the tasks in the research process: discovery, measurement, and inference. We will introduce methods from natural language processing and machine learning (such as clustering, topic modeling, supervised classification, etc.) while demonstrating through applications how they can be incorporated to learn new facts about the social world. Our approach balances teaching you tangible skills now with helping you to see the general problems and how to apply the numerous new (and still being developed) tools to text problems. We will introduce specific examples with R code that will enable you to apply tools to your own problems after the course is over. In each case, we will provide a framework so you know what new tools are trying to accomplish and how you can use them in your work.

Starting August 3, we are offering this seminar as a 4-day synchronous*, remote workshop. Each day will consist of a 3-hour live lecture held via the free video-conferencing software Zoom. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

Each lecture session will conclude with a hands-on exercise reviewing the content covered, to be completed on your own. An additional lab session will be held Tuesday and Thursday afternoons, where you can review the exercise results with the instructors and ask any questions.

*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.

Closed captioning is available for all live and recorded sessions.


This seminar will use R for the demonstrations. To participate in the hands-on exercises, you are strongly encouraged to use a computer with the most recent version of R and RStudio installed. Basic knowledge of R is required to follow these demonstrations. This includes familiarity with matrices, vectors, lists, and data frames, basic data processing skills (e.g., cleaning, merging, or reshaping data in R) and beginner level programming knowledge (e.g., functions and loops).

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations.

Who Should Register

This course is primarily designed for researchers and practitioners who have limited or no prior experience collecting or analyzing text data using automated methods, but have basic familiarity with R.

While this seminar will include coding exercises to demonstrate how the tools can be used in practice, the majority of the time will be devoted to how to use these tools to make inferences in either a research or industry setting. Coding demonstrations will be straightforward and self-contained, primarily geared towards allowing participants to explore the methods. For that reason, even those with very limited coding experience can enjoy and benefit from the course. The course can also benefit those who have prior experience with text tools but want a firmer foundation for how to think about text as evidence.

Many of the methods in contemporary text analysis involve statistical models. A basic statistical foundation (probability, linear regression) will help, but the course will always provide the core intuition for users who don’t have that background. (Just know that we might still ask you to look at some equations!).

Seminar Outline

Day 1:

1. Introductions and Principles

  1. What Text Methods Can Do
  2. Core Concepts and Principles
    • An inductive research process
    • Validation for social science purposes
  3. Example Applications
    • Chinese social media
    • Congressional press releases
    • Survey Experiments

2. Representing Text as Data

  1. A Basic “Bag of Words” Recipe
  2. The Multinomial Language Model
  3. The Vector Space Model and Distance Metrics
  4. Word Embeddings

Day 2:

3. Discovery

  1. Separating Words
  2. Clustering
  3. Topic Models
  4. Document Embeddings

Day 3:

4. Measurement

  1. Dictionary Methods
  2. Supervised Classification
  3. Assessing Performance

Day 4:

5. Repurposing Discovery for Measurement

  1. Topic Models
  2. Word Embeddings

6. A Brief Introduction to Causal Inference

  1. Text as Outcome
  2. Text as Treatment
  3. Text as Confounder

Seminar information

Tuesday, August 3, 2021 –
Friday, August 6, 2021

Each day will follow this schedule:

11:00am-2:00pm ET (New York time): Live lecture via Zoom

4:00pm-5:00pm ET: Live lab session via Zoom (Tuesday and Thursday only)

Payment Information

The fee of $895 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.