Accelerating Model Training on Ascend Chips - An Industrial System for Profiling, Analysis and Optimization

Learn about an industrial-scale optimization system for accelerating deep learning model training on Huawei Ascend chips through this 13-minute conference presentation from USENIX ATC '25. Discover how researchers from Nanjing University, Peng Cheng Laboratory, Huawei, Shandong University, and Peking University developed Hermes, a comprehensive system that addresses the critical challenges of optimizing training efficiency for large-scale deep learning models. Explore the three core components of their solution: a lightweight profiling approach that captures sporadic performance fluctuations during extended training sessions, a hierarchical bottleneck analysis framework that provides comprehensive and accurate identification of performance issues among numerous influencing factors, and an optimization advisor that guides the selection of effective optimization strategies. Examine real-world experimental results demonstrating significant performance improvements, including 3.05× speedup for PanGu-α, 1.91× acceleration for MobileNetV1, and 1.19× improvement for Mixture of Experts (MoE) models, all based on three years of practical experience with 135 typical optimization cases on Ascend hardware architecture.