Challenges in Implementing AI/ML Training Job Recovery from GPU/Accelerator Data Poisoning Events
Open Compute Project via YouTube
Free courses from frontend to fullstack and AI
AI, Data Science & Cloud Certificates from Google, IBM & Meta
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This 21-minute conference talk by Anil Agrawal (Hardware Systems Engineer) and David Xiao (Engineering Manager) from Meta addresses the growing challenge of AI/ML training job interruptions caused by hardware faults, particularly GPU/Accelerator memory uncorrected errors in large-scale clusters. Explore how these errors impact training job interruption rates in Meta's large training clusters built on the Grand Teton Training Platform, and learn about the RAS (Reliability, Availability, Serviceability) technology implemented to mitigate these issues. The presenters share their experiences and conclude with a call to action for the Open Compute Project community, proposing various recovery techniques to reduce interruptions caused by hardware uncorrected errors in massive computing environments exceeding 100,000 nodes.
Syllabus
Challenges in implementing AI/ML training job recovery from GPU/Accelerator data poisoning events
Taught by
Open Compute Project