Taming Data Skew in Production ML Pipelines

Learn to identify, understand, and solve data skew challenges in production machine learning pipelines through this 24-minute conference talk from Conf42 ML 2026. Discover how data skew manifests in Apache Spark environments through symptoms like stragglers, shuffle spills, and idle executors, using real-world examples from Roku's large-scale operations. Explore the fundamental nature of data skew and why it breaks parallelism, examining specific scenarios involving power users, hot keys, and power-law distributed data. Understand where skew impacts ML pipelines across recommendation systems, classification tasks, and computer vision applications. Investigate three primary root causes: natural imbalances from real-world events, join-key and aggregation skew during feature engineering, and computational skew from NLP processing, embeddings, and heavy transformations. Master three key mitigation strategies including repartitioning techniques with their limitations, key salting methods to split hot keys for significant runtime improvements, and broadcast joins to eliminate massive shuffles. Gain insights into selecting appropriate solutions for different scenarios and explore emerging AI approaches for predicting skew before it impacts production systems.

Syllabus

Intro: Handling Data Skew in Production ML Pipelines Roku
Roku Scale & the Mystery of Suddenly Slower Spark Jobs
How Skew Shows Up in Spark: Stragglers, Shuffle Spills, Idle Executors
What Data Skew Really Is and Why Parallelism Breaks
Real-World Example: Power Users, Hot Keys, and Power-Law Data
Why It Matters: Technical Bottlenecks + Business Cost Blowups
Where Skew Hits ML Pipelines: Recs, Classification, Computer Vision
Root Causes of Skew #1: Natural Imbalance from Real-World Events
Root Causes of Skew #2: Join-Key & Aggregation Skew in Feature Engineering
Root Causes of Skew #3: Computational Skew NLP, Embeddings, Heavy Transforms
Mitigation Step 1: Repartitioning—When It Works and Its Limits
Mitigation Step 2: Key Salting to Split Hot Keys Big Runtime Wins
Mitigation Step 3: Broadcast Joins to Avoid Massive Shuffles
Wrap-Up: Choosing the Right Fix + AI to Predict Skew Before It Happens
Closing & How to Connect