Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
The Private Equity Associate Certification
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about SAVE, a software-implemented fault tolerance system designed to protect machine learning model inference from GPU memory bit flips in safety-critical applications. Discover how researchers from Shanghai Jiao Tong University address the challenge of maintaining model accuracy when hardware failures occur in autonomous driving, industrial robots, and satellite systems. Explore the key insight that not all hardware bits have equal impact on model inference, and understand how SAVE leverages small but reliable memory available in modern AI accelerators. Examine the four-stage approach: Selection for identifying vulnerable bits based on model inference characteristics, Allocation for prioritizing vulnerable computations in reliable memory, Verification through asynchronous CPU checks for efficient error detection, and Edit for fault recovery. Review evaluation results demonstrating how SAVE maintains model accuracy even under 4,000 bit flips while introducing less than 9% performance overhead across computer vision, robotics, and decision-making models.
Syllabus
USENIX ATC '25 - SAVE: Software-Implemented Fault Tolerance for Model Inference against GPU Memory..
Taught by
USENIX