Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Coursera

Building Resilient Systems

Starweaver via Coursera

Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Building resilient systems requires more than knowing individual tools—it demands the ability to design architectures that anticipate failure and recover effectively. In this intermediate course, you will learn how to apply resilience engineering principles to modern distributed systems, focusing on high availability, fault tolerance, and disaster recovery planning. You will analyze how and why systems fail, identify hidden risks in system architecture, and design strategies that improve uptime and reliability. The course connects key concepts such as load balancing, redundancy, observability, and incident response into a cohesive resilience strategy aligned with business goals like RTO and RPO. Designed for IT professionals, DevOps engineers, and system architects, this course emphasizes practical decision-making, trade-offs, and operational readiness. By the end, you will be able to design resilient architectures, strengthen system reliability, and lead effective incident management and continuous improvement practices.

Syllabus

  • Foundations of Resilient Systems
    • This module introduces the core concepts behind resilient system design. Learners will explore why failures are inevitable, how resilient systems differ from traditional architectures, and the foundational principles used to build systems that can withstand, adapt to, and recover from disruptions. The module sets the mindset and technical baseline required for designing reliable and fault-aware systems.
  • High Availability and Fault Tolerance Design
    • This module focuses on designing systems that remain available despite failures. Learners will explore high availability concepts, fault tolerance techniques, and architectural patterns used to eliminate single points of failure. The module emphasizes practical design decisions that improve uptime while balancing cost and complexity.
  • Disaster Recovery Planning and Operational Readiness
    • This module focuses on preparing systems and teams to recover from major disruptions. Learners will explore backup and recovery strategies, define recovery objectives, design disaster recovery testing approaches, and create operational runbooks that support consistent and effective recovery. The module emphasizes planning, decision-making, and operational readiness rather than tool-specific implementation.
  • Monitoring, Observability, and Incident Management
    • This module focuses on maintaining system reliability through effective monitoring, observability, and structured incident management. Learners will explore how logs, metrics, and traces provide system visibility, how alerting strategies support timely response, and how post-incident reviews drive continuous improvement. The module emphasizes operational effectiveness and learning from incidents rather than tool-specific implementation.

Taught by

Ahmed Elhenedy and Starweaver

Reviews

Start your review of Building Resilient Systems

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.