Python, Prompt Engineering, Data Science — Build the Skills Employers Want Now
Most AI Pilots Fail to Scale. MIT Sloan Teaches You Why — and How to Fix It
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about an industrial-scale optimization system for accelerating deep learning model training on Huawei Ascend chips through this 13-minute conference presentation from USENIX ATC '25. Discover how researchers from Nanjing University, Peng Cheng Laboratory, Huawei, Shandong University, and Peking University developed Hermes, a comprehensive system that addresses the critical challenges of optimizing training efficiency for large-scale deep learning models. Explore the three core components of their solution: a lightweight profiling approach that captures sporadic performance fluctuations during extended training sessions, a hierarchical bottleneck analysis framework that provides comprehensive and accurate identification of performance issues among numerous influencing factors, and an optimization advisor that guides the selection of effective optimization strategies. Examine real-world experimental results demonstrating significant performance improvements, including 3.05× speedup for PanGu-α, 1.91× acceleration for MobileNetV1, and 1.19× improvement for Mixture of Experts (MoE) models, all based on three years of practical experience with 135 typical optimization cases on Ascend hardware architecture.
Syllabus
USENIX ATC '25 - Accelerating Model Training on Ascend Chips: An Industrial System for Profiling...
Taught by
USENIX