Operationalizing High-Performance GPU Clusters in Kubernetes - Lessons Learned from Training Databricks DBRX

Explore a technical conference talk that delves into the challenges and solutions of operating high-performance GPU clusters within Kubernetes environments, specifically focusing on the training of Databricks DBRX. Learn how to manage a 400-node cluster with 3072 GPUs, implement effective GPU health monitoring using Prometheus and DCGM Exporter, and handle GPU Direct Remote Direct Memory Access (GDRDMA) monitoring. Discover practical insights into addressing failure scenarios during large language model training, and understand the engineering considerations needed when working with GPU clusters across multiple cloud providers. Gain valuable knowledge about maintaining healthy node fleets and interconnect fabric while training state-of-the-art LLMs at scale.