Machine Learning with Imbalanced Data

via Train in Data

Go to class Write review

Details

Go to class

Provider

Train in Data
Pricing

Paid Course
Languages

English
Certificate

Certificate Available
Sessions

Self-Paced
Level

Advanced

Found in

Overview

The most comprehensive online course on machine learning with imbalanced data. Learn about under-sampling, over-sampling, SMOTE and much more.

Discover the truth about SMOTE and other resampling methods.

If you're disappointed for whatever reason, you'll get a full refund.

Sole is a lead data scientist, instructor and developer of open source software. She created and maintains the Python library for feature engineering Feature-engine, which allows us to impute data, encode categorical variables, transform, create and select features. Sole is also the author of the book "Python Feature engineering Cookbook" by Packt editorial.

Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques to improve the performance of machine learning models trained with imbalanced datasets.

Imbalanced datasets are those typically used in classification problems where one of the target classes is extremely under-represented. When this happens, we talk about a class imbalance. The class with a small number of samples is called the minority class, and the class or classes with plenty of data are called the majority class or classes.

Imbalanced datasets are a common occurrence in data science. Examples of imbalanced datasets are those used for fraud detection or medical diagnosis.

Most machine learning algorithms assume balanced class distributions. Thus, training classifiers on imbalanced data will naturally bias the model towards the majority class.

In addition, because the number of samples for the minority class is small, rules to accurately predict these classes are hard to find. Thus, observations belonging to the minority class most often end up being misclassified by the classification models.

Fortunately, there are various ways in which we can improve the performance of classifiers trained on data with imbalanced classes, including resampling, cost-sensitive learning, and ensemble methods.

In this course, you will learn multiple methods to improve the performance of machine learning models trained on imbalanced data and decrease the misclassification of the minority class or classes.

The course is divided into the following sections:

You will learn suitable metrics to assess imbalanced classification models trained with imbalanced datasets. You will learn about the roc-curve and the roc-auc. You will create a confusion matrix, find true positives, true negatives, false positives, and false negatives, and then use them to calculate other metrics like precision, recall, and the f1-score. You will also learn about specific performance metrics to assess imbalanced classification models, like the imbalanced accuracy, among others.

Syllabus

Welcome
- Introduction
- Course Curriculum Overview
- Working with imbalanced data in 2024
- Additional resources
Course material
- Course Material
- Code | Jupyter notebooks
- Presentations covered in the course
- Download Datasets
- Python package Imbalanced-learn
- How did you hear about us?
Machine Learning with Imbalanced Data: Overview
- Introduction to imbalanced data
- Nature of the imbalanced class
- Approaches to work with imbalanced datasets
- Reading Resources
- Refer a friend program
- Quiz
Evaluation Metrics
- Introduction to Performance Metrics
- Accuracy
- Accuracy - Demo
- Precision, Recall and F-measure
- Install Yellowbrick
- Precision, Recall and F-measure - Demo
- Confusion tables, FPR and FNR
- Confusion tables, FPR and FNR - Demo
- Balanced Accuracy
- Balanced accuracy - Demo
- Geometric Mean, Dominance, Index of Imbalanced Accuracy
- Geometric Mean, Dominance, Index of Imbalanced Accuracy - Demo
- ROC-AUC
- ROC-AUC - Demo
- Precision-Recall Curve
- Precision-Recall Curve - Demo
- Additional reading resources
- Probability
- Tuning the probability threshold with sklearn
- Bringing it all together - credit risk
- Quiz - binary classification
- Metrics for Mutliclass
- Metrics for Multiclass - Demo
- PR and ROC Curves for Multiclass
- PR Curves in Multiclass - Demo
- ROC Curve in Multiclass - Demo
- Quiz - multiclass classification
- How are we doing?
Cost Sensitive Learning
- Cost-sensitive Learning
- Types of Cost
- Obtaining the Cost
- Cost Sensitive Approaches
- Misclassification Cost in Logistic Regression
- Misclassification Cost in Decision Trees
- Cost Sensitive Learning with Scikit-learn
- Find Optimal Cost with hyperparameter tuning
- Cost sensitive learning - credit risk
- Cost sensitive learning in ensemble methods
- CSL: before or after feature engineering?
- Cost-sensitive pipelines
- Quiz - cost sensitive learning
- Bayes Conditional Risk
- MetaCost
- MetaCost - Demo
- Optional: MetaCost Base Code
- Wrapping up
- How are we doing?
- Additional Reading Resources
Udersampling
- Under-Sampling Methods - Introduction
- Random Under-Sampling - Intro
- Random Under-Sampling - Demo
- Condensed Nearest Neighbours - Intro
- Condensed Nearest Neighbours - Demo
- Tomek Links - Intro
- Tomek Links - Demo
- One Sided Selection - Intro
- One Sided Selection - Demo
- Edited Nearest Neighbours - Intro
- Edited Nearest Neighbours - Demo
- Repeated Edited Nearest Neighbours - Intro
- Repeated Edited Nearest Neighbours - Demo
- All KNN - Intro
- All KNN - Demo
- Neighbourhood Cleaning Rule - Intro
- Neighbourhood Cleaning Rule - Demo
- NearMiss - Intro
- NearMiss - Demo
- Instance Hardness Threshold - Intro
- Instance Hardness Threshold - Demo
- Instance Hardness Threshold Multiclass Demo
- Undersampling Method Comparison
- Quiz - undersampling comparison
- Setting up a classifier with under-sampling and cross-validation
- Quiz - comparison with cross-validation
- Undersampling methods comparison with hyperparameter tuning
- Wrapping up the section
- How are we doing?
- Summary Table
- Added Treat: A Movie We Recommend 🍿
Oversampling
- Over-Sampling Methods - Introduction
- Random Over-Sampling
- Random Over-Sampling - Demo
- ROS with smoothing - Intro
- ROS with smoothing - Demo
- SMOTE
- SMOTE - Demo
- SMOTE-NC
- SMOTE-NC - Demo
- SMOTE-N
- SMOTE-N Demo
- ADASYN
- ADASYN - Demo
- Borderline SMOTE
- Borderline SMOTE - Demo
- SVM SMOTE
- Resources on SVMs
- SVM SMOTE - Demo
- K-Means SMOTE
- K-Means SMOTE - Demo
- Over-Sampling Method Comparison
- Quiz - oversampling methods comparison
- Oversampling method comparison - take 2
- Wrapping up the section
- SMOTE in 2024
- How to Correctly Set Up a Classifier with Over-sampling
- Summary Table
- Extra Treat: Our Reading Suggestion 📕
Over and Undersampling
- Combining Over and Under-sampling
- SMOTE + ENN and SMOTE + Tomek Links - Demo
- Comparison of Over and Under-sampling Methods
- Combine over and under-sampling manually
- Wrapping up
Ensemble Methods
- Ensemble methods with Imbalanced Data
- Foundations of Ensemble Learning
- Bagging
- Bagging with over- or undersampling
- Boosting
- Boosting with resampling
- Hybdrid Methods
- Ensemble Methods - Demo
- Comparison of ensemble methods
- Wrapping up
- Additional Reading Resources
- More Wisdom: Our Chosen Podcast Episode 🎧
Probability Calibration
- Probability Calibration
- Probability Calibration Curves
- Probability Calibration Curves - Demo
- Brier Score
- Brier Score - Demo
- Under- and Over-sampling and Cost-sensitive learning on Probability Calibration
- Calibrating a Classifier
- Calibrating a Classifier - Demo
- Calibrating a Classfiier after SMOTE or Under-sampling
- Calibrating a Classifier with Cost-sensitive Learning
- Quiz
- Additional reading resources
Wrapping-up
- Wrapping-up
- Assignment
Next steps
- Congratulations
- Next steps