Spark Right-Sizing - Saving Thousands of PBHrs of Compute at LinkedIn

Learn how LinkedIn developed an automated Spark executor memory right-sizing system to optimize resource allocation across over 400,000 daily Spark applications consuming 200+ petabyte-hours of compute. Discover the challenges of manual Spark memory configuration that led to low utilization and frequent out-of-memory errors, and explore the policy-based solution implemented with nearline and real-time feedback loops. Examine how historical data analysis and real-time error classification enable dynamic memory adjustments that significantly reduce the gap between allocated and utilized resources while improving job reliability. Understand the technical architecture and implementation details of this automated tuning system that achieved a 13% increase in memory utilization, a 90% reduction in OOM-related job failures, and annual savings of thousands of petabyte-hours of compute resources. Gain insights into scaling Spark optimization techniques for large enterprise environments and learn practical approaches to improving both resource efficiency and user productivity in big data processing workflows.