Optimizing Training Performance for Large Language Models in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Learn Backend Development Part-Time, Online
Build GenAI Apps from Scratch — UCSB PaCE Certificate Program
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
This conference talk explores the challenges and solutions for optimizing Large Language Model (LLM) training performance in Kubernetes at scale. Learn how to achieve optimal performance and linearity for massive training jobs involving up to 100,000 GPUs. Discover the three most critical factors affecting performance and follow a step-by-step optimization approach. The speakers present an end-to-end analysis of bottlenecks in LLM training at scale, demonstrating how insufficient resource management and lack of network topology awareness in Kubernetes impact performance. Explore new resource management models, LLM-dedicated training workloads, and scheduling solutions developed within the Volcano open source community that help achieve optimal performance and linearity for large-scale AI training operations.
Syllabus
Optimizing Training Performance for Large Language Model(LLM) in Kubernetes - Klaus Ma & Peng Gu
Taught by
CNCF [Cloud Native Computing Foundation]