Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Stability in Large Model Training - Practices in Software and Hardware Fault Self-Healing

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about maintaining stability in large-scale AI model training through comprehensive software and hardware fault self-healing practices in this 23-minute conference talk from CNCF. Discover how Ant Group addresses the critical challenge of maintaining full-speed GPU utilization when training trillion-parameter AI models, where hardware and software failures frequently disrupt operations and increase costs. Explore real-world insights from LLaMA3's training experience, which faced 419 interruptions over 54 days with 78% attributed to hardware issues, highlighting the urgent need for automated anomaly recovery systems. Gain practical knowledge about implementing comprehensive GPU monitoring from hardware to application levels, developing self-healing mechanisms for large GPU clusters with 10,000+ GPUs including automated fault isolation and recovery from kernel panics, and establishing core Service Level Objectives that achieve over 98% GPU availability and more than 90% automatic fault isolation. Understand how predictive maintenance using failure pattern analysis can significantly reduce downtime and improve overall system reliability in large-scale AI infrastructure deployments.

Syllabus

Stability in Large Model Training: Practices in Software and Hardware Fault Self-Healing - Yang Cao

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Stability in Large Model Training - Practices in Software and Hardware Fault Self-Healing

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.