Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

SAVE - Software-Implemented Fault Tolerance for Model Inference against GPU Memory Bit Flips

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about SAVE, a software-implemented fault tolerance system designed to protect machine learning model inference from GPU memory bit flips in safety-critical applications. Discover how researchers from Shanghai Jiao Tong University address the challenge of maintaining model accuracy when hardware failures occur in autonomous driving, industrial robots, and satellite systems. Explore the key insight that not all hardware bits have equal impact on model inference, and understand how SAVE leverages small but reliable memory available in modern AI accelerators. Examine the four-stage approach: Selection for identifying vulnerable bits based on model inference characteristics, Allocation for prioritizing vulnerable computations in reliable memory, Verification through asynchronous CPU checks for efficient error detection, and Edit for fault recovery. Review evaluation results demonstrating how SAVE maintains model accuracy even under 4,000 bit flips while introducing less than 9% performance overhead across computer vision, robotics, and decision-making models.

Syllabus

USENIX ATC '25 - SAVE: Software-Implemented Fault Tolerance for Model Inference against GPU Memory..

Taught by

USENIX

Reviews

Start your review of SAVE - Software-Implemented Fault Tolerance for Model Inference against GPU Memory Bit Flips

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.