Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn systematic approaches to incident response and problem-solving through this 44-minute conference talk that transforms chaotic emergency situations into structured, effective responses. Explore the fundamental decision-making loops and epistemological frameworks essential for reasoning about unknown system failures under pressure. Master the art of constructing worthwhile hypotheses and designing effective tests to isolate problem locations and root causes. Discover how to structure incident notes and progress updates that maximize signal while minimizing noise during fast-paced crisis situations. Develop interpersonal communication techniques that maintain information flow, encourage team participation, and create psychological safety for effective collaboration during high-stress incidents. Gain practical methodologies drawn from nearly a decade of Site Reliability Engineering experience across organizations ranging from small startups to publicly traded SaaS companies. Acquire a foundational curriculum for teaching these critical incident response concepts to others, combining insights from infrastructure engineering, incident command systems, and emergency medical care practices.
Syllabus
Epistemology of Incidents & Problem Solving
Taught by
NANOG