Challenges in Implementing AI/ML Training Job Recovery from GPU/Accelerator Data Poisoning Events

This 21-minute conference talk by Anil Agrawal (Hardware Systems Engineer) and David Xiao (Engineering Manager) from Meta addresses the growing challenge of AI/ML training job interruptions caused by hardware faults, particularly GPU/Accelerator memory uncorrected errors in large-scale clusters. Explore how these errors impact training job interruption rates in Meta's large training clusters built on the Grand Teton Training Platform, and learn about the RAS (Reliability, Availability, Serviceability) technology implemented to mitigate these issues. The presenters share their experiences and conclude with a call to action for the Open Compute Project community, proposing various recovery techniques to reduce interruptions caused by hardware uncorrected errors in massive computing environments exceeding 100,000 nodes.