Optimizing Training Performance for Large Language Models in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This conference talk explores the challenges and solutions for optimizing Large Language Model (LLM) training performance in Kubernetes at scale. Learn how to achieve optimal performance and linearity for massive training jobs involving up to 100,000 GPUs. Discover the three most critical factors affecting performance and follow a step-by-step optimization approach. The speakers present an end-to-end analysis of bottlenecks in LLM training at scale, demonstrating how insufficient resource management and lack of network topology awareness in Kubernetes impact performance. Explore new resource management models, LLM-dedicated training workloads, and scheduling solutions developed within the Volcano open source community that help achieve optimal performance and linearity for large-scale AI training operations.
Syllabus
Optimizing Training Performance for Large Language Model(LLM) in Kubernetes - Klaus Ma & Peng Gu
Taught by
CNCF [Cloud Native Computing Foundation]