Holmes - Localizing Irregularities in LLM Training with Mega-scale GPU Clusters

Learn about a novel system for detecting and localizing performance irregularities in large-scale GPU clusters during LLM training through this 16-minute conference presentation from NSDI '25. Discover how researchers from Fudan University, Tencent, and the University of Chicago developed Holmes, a first-of-its-kind system that addresses the critical but often overlooked problem of irregular training iterations that can take more than twice the normal time, significantly extending overall training duration beyond the impact of traditional failures. Explore the system's core components including an enhanced abnormal operator detection model and a novel communication operator graph that enable real-time irregularity localization with 97.21% accuracy. Understand how Holmes leverages communication operators and cross-iteration analysis to efficiently identify performance bottlenecks in mega-scale GPU environments, achieving irregularity localization within 30.3 seconds with a 6.52× speedup compared to traditional approaches. Gain insights into large-scale measurements conducted across tens of thousands of GPUs that revealed the silent nature of these performance irregularities and their substantial impact on LLM training efficiency, along with evaluation results from both trace-driven simulations and production-level prototype testing.