Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn about the development and deployment of Aegis, a specialized fault diagnosis system designed for AI model training services in production cloud environments, in this 16-minute conference presentation from NSDI '25. Discover how traditional cloud computing diagnosis systems fall short when applied to AI model training scenarios due to fundamental differences in computing paradigms. Explore the two-phase evolution of Aegis, beginning with Phase-1's enhancement of existing general-purpose diagnosis systems while maintaining easy deployment as a core principle. Understand the strategic shift to Phase-2, which involved customizing collective communication libraries to achieve sophisticated failure localization during runtime without requiring modifications to customer code. Examine additional capabilities integrated into Aegis, including performance degradation handling and pre-delivery failure checking mechanisms. Gain insights from real-world production deployment results showing Aegis's significant impact: a 97% reduction in idle time wasted during diagnosis, 84% decrease in training task restart counts, and 71% improvement in performance degradation issues across Alibaba Cloud's training service infrastructure.