Challenges in Implementing AI/ML Training Job Recovery from GPU/Accelerator Data Poisoning Events
Open Compute Project via YouTube
AI Engineer - Learn how to integrate AI into software applications
Introduction to Programming with Python
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This 21-minute conference talk by Anil Agrawal (Hardware Systems Engineer) and David Xiao (Engineering Manager) from Meta addresses the growing challenge of AI/ML training job interruptions caused by hardware faults, particularly GPU/Accelerator memory uncorrected errors in large-scale clusters. Explore how these errors impact training job interruption rates in Meta's large training clusters built on the Grand Teton Training Platform, and learn about the RAS (Reliability, Availability, Serviceability) technology implemented to mitigate these issues. The presenters share their experiences and conclude with a call to action for the Open Compute Project community, proposing various recovery techniques to reduce interruptions caused by hardware uncorrected errors in massive computing environments exceeding 100,000 nodes.
Syllabus
Challenges in implementing AI/ML training job recovery from GPU/Accelerator data poisoning events
Taught by
Open Compute Project