AI Adoption - Drive Business Value and Organizational Impact
The Most Addictive Python and SQL Courses
Overview
Syllabus
Intro: Handling Data Skew in Production ML Pipelines Roku
Roku Scale & the Mystery of Suddenly Slower Spark Jobs
How Skew Shows Up in Spark: Stragglers, Shuffle Spills, Idle Executors
What Data Skew Really Is and Why Parallelism Breaks
Real-World Example: Power Users, Hot Keys, and Power-Law Data
Why It Matters: Technical Bottlenecks + Business Cost Blowups
Where Skew Hits ML Pipelines: Recs, Classification, Computer Vision
Root Causes of Skew #1: Natural Imbalance from Real-World Events
Root Causes of Skew #2: Join-Key & Aggregation Skew in Feature Engineering
Root Causes of Skew #3: Computational Skew NLP, Embeddings, Heavy Transforms
Mitigation Step 1: Repartitioning—When It Works and Its Limits
Mitigation Step 2: Key Salting to Split Hot Keys Big Runtime Wins
Mitigation Step 3: Broadcast Joins to Avoid Massive Shuffles
Wrap-Up: Choosing the Right Fix + AI to Predict Skew Before It Happens
Closing & How to Connect
Taught by
Conf42