Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Evolution of Aegis - Fault Diagnosis for AI Model Training Service in Production

USENIX via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn about the development and deployment of Aegis, a specialized fault diagnosis system designed for AI model training services in production cloud environments, in this 16-minute conference presentation from NSDI '25. Discover how traditional cloud computing diagnosis systems fall short when applied to AI model training scenarios due to fundamental differences in computing paradigms. Explore the two-phase evolution of Aegis, beginning with Phase-1's enhancement of existing general-purpose diagnosis systems while maintaining easy deployment as a core principle. Understand the strategic shift to Phase-2, which involved customizing collective communication libraries to achieve sophisticated failure localization during runtime without requiring modifications to customer code. Examine additional capabilities integrated into Aegis, including performance degradation handling and pre-delivery failure checking mechanisms. Gain insights from real-world production deployment results showing Aegis's significant impact: a 97% reduction in idle time wasted during diagnosis, 84% decrease in training task restart counts, and 71% improvement in performance degradation issues across Alibaba Cloud's training service infrastructure.

Syllabus

NSDI '25 - Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production

Taught by

USENIX

Reviews

Start your review of Evolution of Aegis - Fault Diagnosis for AI Model Training Service in Production

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.