Apache Spark? If Only It Worked

Explore common challenges and optimization techniques for Apache Spark in this 31-minute conference talk from Devoxx. Gain insights into dealing with skewed data, understanding Spark on YARN and its memory model, effective caching strategies, sizing executors, and achieving data locality. Learn from real-world examples and practical solutions to improve performance and stability in Spark applications. Discover a framework for troubleshooting and optimizing Spark jobs, covering topics such as RDD evaluation, execution plans, and debugging tools. Benefit from the speaker's extensive experience working with data infrastructure at companies like VRBO, Spotify, TrueCaller, and Apple.

Syllabus

Introduction
My experience with Spark
Outline of the talk
What is Spark
RDD
Pipelines
Execution Unit
Executor
executor size
small executors
Spark memory model
Memory overhead
Shuffle
In practice
Spark UI
Execution Plan
Skew Data
Locality
Check locality
RDD lazily evaluated
RDD calculation twice
Spark caching
Spark optimization
Map volumes
Improve shuffle
Recap
Debugging tools
Challenge
Use Case
Summary
Questions