Overview

In “Applied Unsupervised Learning in Python,” you will learn how to use algorithms to find interesting structure in datasets. You will practice applying, interpreting, and refining unsupervised machine learning models to solve a diverse set of problems on real-world datasets. This course will show you how to explore unlabelled data using several techniques: dimensionality reduction and manifold learning for condensing and visualizing high-dimensional data, clustering to reveal interesting groups and outliers, topic modeling for summarizing important themes in text, methods for dealing with missing data, and more. This course also covers best practices associated with different techniques, as well as demonstrating how unsupervised learning can be used to improve supervised prediction. This is the second course in “More Applied Data Science with Python,” a four-course series focused on helping you apply advanced data science techniques using Python. It is recommended that all learners complete the Applied Data Science with Python specialization prior to beginning this course.

Syllabus

Basic Unsupervised Learning Methods

Welcome to Module 1! In this module, we will learn the basic unsupervised learning methods that focus on transformation of data: dimensionality reduction, manifold learning, and density estimation. We will be using realistic datasets for our analyses, implemented using the scikit-learn library. At the end of this Module, our assignment is to apply Principal Components Analysis to gain insight into a large real-world dataset. We will use manifold learning methods such as t-SNE to visualize complex structure, and use kernel density estimation to estimate probabilities of conditional events. Let’s begin!

Clustering

Welcome to Module 2! In this module’s module, we will learn about clustering—another critical and widely-used unsupervised learning method. We will learn about the most important families of clustering algorithms, such as hierarchical methods (agglomerative bottom-up, divisive top-down), partitioning methods (k-means, k-medoids) and density-based methods (DBSCAN). We will also gain awareness of how to evaluate and optimize cluster quality. At the end of this module, our assignment is to apply a variety of these clustering approaches to realistic datasets using SciKit-Learn's clustering capabilities. Let’s begin!

Unsupervised Methods for Text Analysis

Welcome to Module 3! In this module’s module, we will learn about estimating latent variables—another important area of unsupervised learning, especially for text-based applications. We will focus first on the topic of text representations. Topic modeling is another form of latent variable estimation, which we will learn about via two different methods: Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization. We will also survey word embeddings to learn how to represent words with vectors in semantically useful ways. At the end of this module, our assignment is to solve problems through analyzing topic structure in a large document collection, and applying word embeddings to an NLP-related task. Let’s begin!

Applications and Variants of Unsupervised Learning

Welcome to Module 4, our last module of the course! We wrap up our course by learning about how unsupervised methods can be integrated with supervised learning methods to improve prediction performance. A key topic this module in that direction covers imputation methods for dealing with missing data. We will also look at various special topics, including extensions of unsupervised learning that are used at the cutting edge of today's technology: semi-supervised learning and self-supervised learning. At the end of this module, our assignment is to apply methods and techniques for imputing missing data and semi-supervised learning, with the underlying theme being how unsupervised learning can improve supervised learning. Let’s begin!