Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Accelerating AI Training Fleets with sched_ext

Linux Plumbers Conference via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how Meta engineers deployed sched_ext, a user-space scheduler, to accelerate AI training across tens of thousands of GPUs in their Reality Labs fleet. Discover the challenges of scheduling GPU training workflows that require frequent synchronization and are extremely sensitive to micro-delays that prevent work dispatch to GPUs. Explore how multi-CPU socket systems with attached Nvidia GPUs handle concurrent processes including data loading, preprocessing, and model checkpointing that create scheduling congestion. Understand the implementation of scx_layered scheduler and its deployment methodology across large-scale infrastructure. Examine the identification process for latency-critical system tasks and the development of resource isolation strategies. Follow the debugging approaches for corner cases and comprehensive fleet-wide performance monitoring techniques. Analyze the achieved results including 9% improvement in GPU compute unit utilization for certain model types and overall reduction in fleet training costs.

Syllabus

Accelerating AI training fleets with sched_ext - Patrick Lu, Valentin Andrei, Pat Somaru

Taught by

Linux Plumbers Conference

Reviews

Start your review of Accelerating AI Training Fleets with sched_ext

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.