This conference talk explores the challenges and solutions for optimizing Large Language Model (LLM) training performance in Kubernetes at scale. Learn how to achieve optimal performance and linearity for massive training jobs involving up to 100,000 GPUs. Discover the three most critical factors affecting performance and follow a step-by-step optimization approach. The speakers present an end-to-end analysis of bottlenecks in LLM training at scale, demonstrating how insufficient resource management and lack of network topology awareness in Kubernetes impact performance. Explore new resource management models, LLM-dedicated training workloads, and scheduling solutions developed within the Volcano open source community that help achieve optimal performance and linearity for large-scale AI training operations.