Unsupervised Statistical Learning Using Python: A Short Course

An 8-Hour Livestream Seminar Taught by Edwin Dalmaijer, DPhil

Download Sample Course Slides

Python is a general-purpose programming language. It is open-source, powerful, and easy to use. Because of this, Python is one of the most popular languages in the world, and it has become indispensable in data science.

In this course, we will cover two main classes of analytical approaches that aim to uncover what makes up a dataset by identifying separable dimensions or subgroups. These are unsupervised machine-learning techniques: you simply give them your data, and they will carve it up without requiring further input from you.

Starting August 12, we are offering this seminar as an 8-hour synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two 2-hour lecture sessions which include hands-on exercises, separated by a 30-minute break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously. 

Closed captioning is available for all live and recorded sessions. Captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.

More Details About the Course Content

In the first half of this course, we will cover dimensional analyses. These include principal component analysis, exploratory factor analysis, and independent component analysis. Their objectives are to find the set of (latent) components that gave rise to patterns across variables in your dataset.

In the second half, we will cover subgroup analyses. These are used to find distinct clusters of individuals, and include k-means, c-means, and latent class analysis.

The above will be implemented in scikit-learn, a Python package for machine learning. It is a powerful tool for data science, and its common interface will allow you to extend what you learn in this course to other models.


To run hands-on exercises, we will be using carefully crafted interactive notebooks via Google Colaboratory. For this, you only need an internet browser (like Firefox) and a Google account.

Alternatively, you are welcome to install Python on your own computer. In addition to Python (version 3.7 or higher), you will need the packages NumPy, SciPy, Matplotlib, and scikit-learn. Python package installation can be a bit tricky for those who aren’t familiar with it. We will cover installing Python packages on the first day of the course, so you might want to wait to install anything until then.

Who Should Register?

This course is aimed at people who already know the basics of Python. This includes those who have taken Code Horizons’ Introduction to Python for Data Analysis.

The content leans towards data science, so this course will be especially useful to those who would like to expand their expertise in data handling, visualization, statistics, and basic machine learning.


Day 1: Dimensional analyses

  • Recap of crucial skills
    • NumPy arrays
    • Loading data from files
  • Generating realistic fake data
    • Creating predictors with realistic covariance
    • Creating outcomes with ground-truth models
  • Principal component analysis (PCA)
    • Fitting a PCA
    • Scree plots
  • Factor analysis
    • PCA as a first step
    • Rotations
  • Independent component analysis (ICA)
    • Uncovering the signals that created an outcome
    • Using ICA in noise reduction

Day 2: Subgroup analyses

  • Cluster analysis
    • k-means
    • The curse of dimensionality
    • Evaluating outcomes
  • Hierarchical agglomerative clustering
    • Distance and linkage
    • Plotting a dendrogram
  • Fuzzy clustering
    • c-means
    • Fuzzy silhouette score
  • Mixture modelling
    • Latent class analysis
    • Latent profile analysis

Seminar Information

Monday, August 12 –
Tuesday, August 13, 2024

Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:30am-12:30pm (convert to your local time)

Payment Information

The fee of $695 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.