SAVE - Software-Implemented Fault Tolerance for Model Inference against GPU Memory Bit Flips

Learn about SAVE, a software-implemented fault tolerance system designed to protect machine learning model inference from GPU memory bit flips in safety-critical applications. Discover how researchers from Shanghai Jiao Tong University address the challenge of maintaining model accuracy when hardware failures occur in autonomous driving, industrial robots, and satellite systems. Explore the key insight that not all hardware bits have equal impact on model inference, and understand how SAVE leverages small but reliable memory available in modern AI accelerators. Examine the four-stage approach: Selection for identifying vulnerable bits based on model inference characteristics, Allocation for prioritizing vulnerable computations in reliable memory, Verification through asynchronous CPU checks for efficient error detection, and Edit for fault recovery. Review evaluation results demonstrating how SAVE maintains model accuracy even under 4,000 bit flips while introducing less than 9% performance overhead across computer vision, robotics, and decision-making models.