Bluesky Data with R and LLMs: Collection, Cleaning, and Analysis - A Short Course
An 8-Hour Livestream Seminar Taught by Resul Umit, Ph.D.
The emergence of new social media platforms presents fresh opportunities for researchers. Bluesky, in particular, is rapidly gaining a large and active user base while offering free access to its data — making it an invaluable resource for academics and professionals alike.
Yet, leveraging social media data demands specialized skills that are often absent from traditional academic and professional training. This short course bridges that gap by providing you with the essential knowledge and practical tools — including large language models (LLMs) — to effectively work with social media data from Bluesky.
This three-day course provides a thorough, hands-on introduction to collecting, cleaning, and analyzing data from Bluesky using the R programming language, supported by LLMs. Bluesky’s rich data offer immense potential to inspire innovative research or enhance existing projects. Through a carefully structured series of demonstrations and exercises that progressively build skills, you will gain practical expertise in social media data analysis, opening new doors to impactful research, actionable insights, and professional growth.
By the end of this course, you will be able to effectively use R to access and collect data from the Bluesky API, clean and prepare the resulting datasets for analysis, and apply core methods for analyzing social media data. You’ll also be able to use LLMs programmatically to support data cleaning and enhance analytical tasks.
Starting August 26, this seminar will be presented as a 8-hour synchronous, livestream workshop via Zoom. Each day will feature two lecture sessions with hands-on exercises, separated by a 30-minute break. Live attendance is recommended for the best experience. But if you can’t join in real time, recordings will be available within 24 hours and can be accessed for four weeks after the seminar.
Closed captioning is available for all live and recorded sessions. Captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.
ECTS Equivalent Points: 1
More Details About the Course Content
During the first part of the seminar, you will learn about APIs and how they facilitate data retrieval, focusing on setting up and authenticating access to the Bluesky API. The session will cover techniques for collecting various types of data, from profiles and their posts. Demonstrations and hands-on exercises will involve writing R scripts to retrieve data from Bluesky, ensuring that you gain practical experience from the outset.
The next part of the seminar will focus on cleaning and preparing data for analysis, with an emphasis on the unique characteristics of social media content. The session will introduce key challenges, such as managing hashtags, mentions, emojis, and links, and will cover tidy data principles for efficient data manipulation. Demonstrations will showcase using LLMs as well as R packages designed for processing textual data, illustrating how to clean and structure Bluesky data in preparation for analysis.
The seminar will then focus on analyzing Bluesky data, drawing on core methods in social media analysis, such as textual analysis with and without LLMs, network analysis, and topic modeling. Demonstrations and exercises will provide insights into different analytical strategies and their applications to social media data.
Computing
You should have R, RStudio, and Ollama installed on your local machines, and a user account on Bluesky to enable the collection of data that requires authentication during the seminar. Detailed setup guidance will be provided in advance.
You should have basic knowledge of the R programming language, such as working with data frames and basic functions. Familiarity with data handling in R is advantageous but not required.
If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations.
Who Should Register?
This workshop is ideal for academics interested in social media data, professionals seeking to expand their skill set, and individuals with a basic understanding of R who wish to apply it to real-world data.
Outline
Data collection
-
- The Bluesky API
- Essential libraries and functions for data access in R
- Demonstrations of data collection
- Hands-on exercises: Writing R scripts to collect data from Bluesky API
Data cleaning
-
- Fundamentals of working with social media data
- Essential libraries and functions for data cleaning in R
- LLMs for advanced cleaning and structured extraction
- Demonstrations of data cleaning
- Hands-on exercises: Writing R scripts to prepare the collected data for analysis
Data analysis
-
- Essential libraries and functions for data analysis in R
- LLMs for text and image classification
- Demonstrations of diverse data analysis
- Descriptive summaries and data visualizations
- Textual analysis, network analysis, topic modeling
- Regression analysis across existing and derived variables
- Hands-on exercises: Writing R scripts to analyze cleaned datasets
Seminar Information
Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).
10:30am-12:30pm (convert to your local time)
1:00pm-3:00pm
Payment Information
The fee of $695 USD includes all course materials.
PayPal and all major credit cards are accepted.
Our Tax ID number is 26-4576270.

Back to Public Seminars