Accelerating AI Training Fleets with sched_ext

Learn how Meta engineers deployed sched_ext, a user-space scheduler, to accelerate AI training across tens of thousands of GPUs in their Reality Labs fleet. Discover the challenges of scheduling GPU training workflows that require frequent synchronization and are extremely sensitive to micro-delays that prevent work dispatch to GPUs. Explore how multi-CPU socket systems with attached Nvidia GPUs handle concurrent processes including data loading, preprocessing, and model checkpointing that create scheduling congestion. Understand the implementation of scx_layered scheduler and its deployment methodology across large-scale infrastructure. Examine the identification process for latency-critical system tasks and the development of resource isolation strategies. Follow the debugging approaches for corner cases and comprehensive fleet-wide performance monitoring techniques. Analyze the achieved results including 9% improvement in GPU compute unit utilization for certain model types and overall reduction in fleet training costs.