Text as Data: A Short Course

A 3-Day Livestream Seminar Taught by Amber Boydstun, Ph.D. and

Cory Struthers, Ph.D.

Text is all around us: from archived court documents to this morning’s social media posts, from transcripts of political ads to terrorist manifestos. Text-as-data methods allow us to use this text to measure and discover phenomena that may be otherwise hard or impossible to represent quantitatively, such as ideological positions of court documents and emotional sentiment in manifestos.

There has never been a more exciting time to learn text-as-data methods. Digital advances have made available text content that even a few years ago would have been difficult to collect and computational text-as-data methods have advanced just as fast. However, because there are now countless text data to explore and a dizzying array of accessible text-as-data tools to apply, understanding which methods are appropriate for what contexts is critically important.

This course will provide an introduction to text-as-data methods, including how they work, how they can be applied, and common pitfalls to avoid. We will focus on linking concepts to measurement through textual data. Topics covered include: manual content analysis; text collection and pre-processing; advanced keyword queries and frequencies; dictionary analysis (including sentiment analysis); text similarity and reuse; topic modeling; and supervised machine learning.

Starting April 27, we are offering this seminar as a 3-day synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two lecture sessions which include hands-on exercises, separated by a 1-hour break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously. 

Closed captioning is available for all live and recorded sessions.

More Details About the Course Content

This seminar provides an intensive introduction to text-as-data methods, drawing on social science research and perspectives. The course will be roughly divided into four parts.

In Part 1, we will give an overview of text-as-data methods, highlighting the range of applications they make possible. We will ground this discussion in classic “manual content analysis” methods, which remain the gold standard for validating computational approaches.

In Part 2, we will navigate the process of collecting, organizing, and pre-processing a text dataset, known as a corpus (plural=corpora).

In Part 3, we will examine core text-as-data techniques for which “off the shelf” code exists: advanced keyword queries and frequencies, dictionary methods (including sentiment analysis), text similarity and reuse, and topic modeling.

In Part 4, we will explore more advanced text-as-data methods that require additional data and/or expertise but that also open up additional avenues of research.

Here are some of the things you will be able to do by the end of this course:

    • Develop a content analysis codebook.
    • Acquire and organize text in R.
    • Pre-process text for analysis.
    • Calculate frequencies of key words or phrases in a corpus.
    • Evaluate the sentiment of a corpus.
    • Apply dictionary methods to a corpus.
    • Identify topics in a corpus.
    • Have the foundational knowledge to learn more about advanced text analysis methods.


This seminar will primarily use R and RStudio software for in-class examples and exercises. Both are free, open-source programming languages and should be installed before the course begins. Previous experience with R is helpful but not needed for this course, as all code will be provided.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent on-line resources for learning the basics. Here are our recommendations.

Who Should Register?

This course is designed for anyone who want to apply text-as-data methods to newspapers, legislation, social media, meeting minutes, and other documents. No previous background in text-as-data or statistical methods are necessary. A working understanding of R is helpful but not necessary.


Day 1:

  • Introduction and overview
  • What is our goal? Defining latent variables of interest
  • The gold standard: Manual content analysis
  • Developing a codebook
  • Collecting our data
  • Organizing our data
  • Pre-processing

Day 2:

  • Keyword queries
  • Frequencies
  • Dictionary methods

Day 3:

  • Text similarity and reuse
  • Topic modeling
  • Supervised learning and beyond

Seminar Information

Thursday, April 27 –
Saturday, April 29, 2023

Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:00am-12:30pm (convert to your local time)

Payment Information

The fee of $995 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.