Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Optimizing Training Performance for Large Language Models in Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This conference talk explores the challenges and solutions for optimizing Large Language Model (LLM) training performance in Kubernetes at scale. Learn how to achieve optimal performance and linearity for massive training jobs involving up to 100,000 GPUs. Discover the three most critical factors affecting performance and follow a step-by-step optimization approach. The speakers present an end-to-end analysis of bottlenecks in LLM training at scale, demonstrating how insufficient resource management and lack of network topology awareness in Kubernetes impact performance. Explore new resource management models, LLM-dedicated training workloads, and scheduling solutions developed within the Volcano open source community that help achieve optimal performance and linearity for large-scale AI training operations.

Syllabus

Optimizing Training Performance for Large Language Model(LLM) in Kubernetes - Klaus Ma & Peng Gu

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Optimizing Training Performance for Large Language Models in Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.