Stability in Large Model Training - Practices in Software and Hardware Fault Self-Healing
CNCF [Cloud Native Computing Foundation] via YouTube
Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
Learn Backend Development Part-Time, Online
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about maintaining stability in large-scale AI model training through comprehensive software and hardware fault self-healing practices in this 23-minute conference talk from CNCF. Discover how Ant Group addresses the critical challenge of maintaining full-speed GPU utilization when training trillion-parameter AI models, where hardware and software failures frequently disrupt operations and increase costs. Explore real-world insights from LLaMA3's training experience, which faced 419 interruptions over 54 days with 78% attributed to hardware issues, highlighting the urgent need for automated anomaly recovery systems. Gain practical knowledge about implementing comprehensive GPU monitoring from hardware to application levels, developing self-healing mechanisms for large GPU clusters with 10,000+ GPUs including automated fault isolation and recovery from kernel panics, and establishing core Service Level Objectives that achieve over 98% GPU availability and more than 90% automatic fault isolation. Understand how predictive maintenance using failure pattern analysis can significantly reduce downtime and improve overall system reliability in large-scale AI infrastructure deployments.
Syllabus
Stability in Large Model Training: Practices in Software and Hardware Fault Self-Healing - Yang Cao
Taught by
CNCF [Cloud Native Computing Foundation]