Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Understanding and Detecting Fail-Slow Hardware Failure Bugs in Cloud Systems

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about fail-slow hardware failure bugs in cloud systems through this 14-minute conference presentation from USENIX ATC '25. Explore a comprehensive bug study analyzing 48 real-world fail-slow hardware failures from typical cloud systems, where hardware components continue running but operate in degraded modes causing severe system failures. Discover how fail-slow hardware makes high-level software components vulnerable, particularly synchronized and timeout mechanisms, and understand why fine granularity of fail-slow hardware is necessary to trigger these bugs. Examine Sieve, a novel fault injection testing framework designed specifically for detecting fail-slow hardware failure bugs that statically analyzes target system codes to identify synchronized and timeout-protected I/O operations as candidate fault points. Review the framework's grouping and context-sensitive injection strategies for efficiently exploring candidate fault points, and learn about its successful application to three widely deployed cloud systems: ZooKeeper, Kafka, and HDFS, resulting in the detection of six unknown bugs with two confirmed by maintainers.

Syllabus

USENIX ATC '25 - Understanding and Detecting Fail-Slow Hardware Failure Bugs in Cloud Systems

Taught by

USENIX

Reviews

Start your review of Understanding and Detecting Fail-Slow Hardware Failure Bugs in Cloud Systems

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.