Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about fail-slow hardware failure bugs in cloud systems through this 14-minute conference presentation from USENIX ATC '25. Explore a comprehensive bug study analyzing 48 real-world fail-slow hardware failures from typical cloud systems, where hardware components continue running but operate in degraded modes causing severe system failures. Discover how fail-slow hardware makes high-level software components vulnerable, particularly synchronized and timeout mechanisms, and understand why fine granularity of fail-slow hardware is necessary to trigger these bugs. Examine Sieve, a novel fault injection testing framework designed specifically for detecting fail-slow hardware failure bugs that statically analyzes target system codes to identify synchronized and timeout-protected I/O operations as candidate fault points. Review the framework's grouping and context-sensitive injection strategies for efficiently exploring candidate fault points, and learn about its successful application to three widely deployed cloud systems: ZooKeeper, Kafka, and HDFS, resulting in the detection of six unknown bugs with two confirmed by maintainers.
Syllabus
USENIX ATC '25 - Understanding and Detecting Fail-Slow Hardware Failure Bugs in Cloud Systems
Taught by
USENIX