Prepare for Disruptions - How We Upgrade the Whole ML Training Fleet Bi-weekly

Learn how to build disruption-tolerant machine learning infrastructure on Kubernetes that maintains job continuity during both planned and unplanned node interruptions in this conference talk from KubeCon + CloudNativeCon. Discover battle-tested techniques developed at LinkedIn scale for handling machine learning workloads that are particularly vulnerable to disruptions such as host maintenance, kernel upgrades, security patches, GPU ECC memory errors, and sudden node failures. Explore automatic multi-stage checkpoint and restore mechanisms that enable fast and seamless recovery of training jobs after interruptions. Master intelligent scheduling and smart collocation strategies that account for node health, job characteristics, and maintenance timing to minimize disruption impact. Understand job-aware backpressure mechanisms that coordinate infrastructure updates and reduce the likelihood of disruption during critical phases of machine learning training. Gain practical strategies for managing infrastructure disruptions while balancing platform reliability with job continuity, specifically leveraging Kubernetes capabilities to protect valuable training time and prevent derailed progress in machine learning workflows.