Introduction to Text as Data: A Short Course

A 3-Day Livestream Seminar Taught by Amber Boydstun, Ph.D. and Cory Struthers, Ph.D.

Download Sample Course Slides

Text is all around us: from archived court documents to this morning’s social media posts, from transcripts of political ads to terrorist manifestos. Text-as-data methods allow us to use this text to measure and discover phenomena that may be otherwise hard or impossible to represent quantitatively, such as ideological positions of court documents and emotional sentiment in manifestos.

There has never been a more exciting time to learn text-as-data methods. Digital advances have made available text content that even a few years ago would have been difficult to collect and computational text-as-data methods have advanced just as fast. However, because there are now countless text data to explore and a dizzying array of accessible text-as-data tools to apply, understanding which methods are appropriate for what contexts is critically important.

This course will provide an introduction to text-as-data methods, including how they work, how they can be applied, and common pitfalls to avoid. We will focus on linking concepts to measurement through textual data. Topics covered include: manual content analysis; text collection and pre-processing; advanced keyword queries and frequencies; dictionary analysis (including sentiment analysis); text similarity and reuse; topic modeling; and supervised machine learning.

Starting January 25, we are offering this seminar as a 3-day synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two lecture sessions which include hands-on exercises, separated by a 1-hour break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.

Closed captioning is available for all live and recorded sessions. Live captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.

Computing

This seminar will primarily use R and RStudio software for in-class examples and exercises. Both are free, open-source programming languages and should be installed before the course begins. A basic literacy in R is needed to get the most out of the course.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent on-line resources for learning the basics. Here are our recommendations.

Who Should Register?

This course is designed for anyone who wants to apply text-as-data methods to newspapers, legislation, social media, meeting minutes, and other documents. No previous background in text-as-data or statistical methods are necessary. However, a working understanding of R is essential.

Outline

Day 1:

Introduction and overview
What is our goal? Defining latent variables of interest
The gold standard: Manual content analysis
Pre-processing text data
Approaches to measuring word frequencies

Day 2:

Understanding dictionary methods
Using established dictionaries; considerations for generating your own
Sentiment analysis
Topical dictionaries and related types

Day 3:

Text similarity and reuse
Different approaches to text similarity, including cosine similarity
Topic modeling and validation
Resources for pursuing more advanced topics

Reviews of Introduction to Text as Data

“I really enjoyed the open conversations.”
Sukumar Ganapati, Florida International University

“I loved the balance between the theory, background, and methods. I enjoyed trying to then apply it.”
Anandi Hira, Carnegie Mellon University

Seminar Information

Thursday, January 25 –
Saturday, January 27, 2024

Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:00am-12:30pm (convert to your local time)
1:30pm-3:30pm

Payment Information

The fee of $995 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.

Contact Information

+1 610-715-0115 info@statisticalhorizons.com