Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Why Your 'Reliable' System Will Fail

InfoQ via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a comprehensive 50-minute conference talk that challenges conventional approaches to system reliability and Site Reliability Engineering (SRE). Discover why traditional Root Cause Analysis (RCA) methods are fundamentally flawed and learn seven critical dimensions of reliability that extend far beyond simple availability metrics, including latency, throughput, and fidelity. Examine the dangerous myths that perpetuate system instability, team burnout, and customer frustration while understanding the crucial distinction between fault tolerance and true resilience in system design. Delve into the SRE mindset built on curiosity, collaboration, and treating failure as valuable signal rather than something to eliminate. Learn why single-cause thinking fails in complex systems and discover why popular techniques like the "5 Whys" can be counterproductive. Navigate through four common post-incident review traps including human error attribution and counterfactual reasoning that prevent meaningful learning from failures. Understand the five evolutionary stages of SRE maturity, from reactive firefighting to proactive partnership with development teams. Master strategies for selling reliability initiatives internally while avoiding common pitfalls and unrealistic claims. Explore the concept of "conservation of toil" and evaluate whether your organization is simply trading one form of operational burden for increased system complexity. Gain insight into why resilience should be viewed as an active, ongoing process rather than a static architectural property, fundamentally reshaping how you approach system design and operational practices.

Syllabus

0:00 The Rilke Question: Why you must live the questions now.
1:15 Reliability is NOT Just Availability: The 7 dimensions of reliability Latency, Throughput, Fidelity, etc..
3:05 Quiz: Is Your Outage an Existential Crisis? The Customer Perspective.
5:50 The SRE Mindset: Curiosity, Collaboration, and Failure as Signal.
8:10 The Myth of Root Cause: Why single-cause thinking fails in complex systems.
10:35 Stop Doing the 5 Whys: Why this common technique is terrible.
12:15 4 Post-Incident Review Traps Human Error, Counterfactual Reasoning, etc..
16:00 The 5 Stages of SRE: From Firefighting to Partnering.
18:40 How to Sell Reliability Internally and what not to claim.
21:00 The Conservation of Toil: Are you trading toil for complexity?
23:00 Resilience is a Verb: Why your "resilient" architecture is just fault tolerant.

Taught by

InfoQ

Reviews

Start your review of Why Your 'Reliable' System Will Fail

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.