Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

One-Size-Fits-None - Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

USENIX via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn about slow-fault tolerance challenges in modern distributed systems through this 17-minute conference talk from NSDI '25. Explore comprehensive research investigating how distributed software handles fail-slow behavior in hardware components, which has become increasingly problematic at scale. Discover the nuanced characteristics of slow faults and understand why current static threshold-based handling mechanisms prove inadequate for their highly sensitive and dynamic nature. Examine a systematic testing pipeline designed to introduce diverse slow faults, measure their impact across different workloads, and identify behavioral patterns that reveal how even minor changes can trigger dramatically different system reactions. Get introduced to ADR (Adaptive Distributed Resilience), a lightweight library solution that enables adaptive fail-slow handling within system code, and review evaluation results demonstrating its effectiveness in significantly reducing slow fault impact on distributed system performance.

Syllabus

NSDI '25 - One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern...

Taught by

USENIX

Reviews

Start your review of One-Size-Fits-None - Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.