Challenges in Implementing AI/ML Training Job Recovery from GPU/Accelerator Data Poisoning Events
Open Compute Project via YouTube
Free courses from frontend to fullstack and AI
Google, IBM & Meta Certificates — 40% Off for a Limited Time
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
This 21-minute conference talk by Anil Agrawal (Hardware Systems Engineer) and David Xiao (Engineering Manager) from Meta addresses the growing challenge of AI/ML training job interruptions caused by hardware faults, particularly GPU/Accelerator memory uncorrected errors in large-scale clusters. Explore how these errors impact training job interruption rates in Meta's large training clusters built on the Grand Teton Training Platform, and learn about the RAS (Reliability, Availability, Serviceability) technology implemented to mitigate these issues. The presenters share their experiences and conclude with a call to action for the Open Compute Project community, proposing various recovery techniques to reduce interruptions caused by hardware uncorrected errors in massive computing environments exceeding 100,000 nodes.
Syllabus
Challenges in implementing AI/ML training job recovery from GPU/Accelerator data poisoning events
Taught by
Open Compute Project