Operationalizing High-Performance GPU Clusters in Kubernetes - Lessons Learned from Training Databricks DBRX
CNCF [Cloud Native Computing Foundation] via YouTube
Become an AI & ML Engineer with Cal Poly EPaCE — IBM-Certified Training
Learn Backend Development Part-Time, Online
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore a technical conference talk that delves into the challenges and solutions of operating high-performance GPU clusters within Kubernetes environments, specifically focusing on the training of Databricks DBRX. Learn how to manage a 400-node cluster with 3072 GPUs, implement effective GPU health monitoring using Prometheus and DCGM Exporter, and handle GPU Direct Remote Direct Memory Access (GDRDMA) monitoring. Discover practical insights into addressing failure scenarios during large language model training, and understand the engineering considerations needed when working with GPU clusters across multiple cloud providers. Gain valuable knowledge about maintaining healthy node fleets and interconnect fabric while training state-of-the-art LLMs at scale.
Syllabus
Operationalizing High-Performance GPU Clusters in Kubernetes: Lessons Learne... Will Gleich & Wai Wu
Taught by
CNCF [Cloud Native Computing Foundation]