In this course, you will learn the core components of Site Reliability Engineering. This course starts with introducing the Zero Trust Security system, then proceeds to discuss Service Level Objectives and indicators, capacity management, on-call effectiveness, and incident management.
Overview
Syllabus
- Zero Trust Security Concepts
- This lesson is a review of the core components required to implement a zero trust security system and how policy-based management systems allow us to "Never Trust, Always Verify".
- An Introduction to SLOs and SLIs
- In this lesson, we will learn about how SREs monitor using SLOs and SLIs. We will create queries in Prometheus and dashboard in Grafana.
- Capacity Management: Managing System Capacity
- System capacity is an essential part of ensuring reliability. This lesson discusses how to balance system capacity with costs to ensure that resources and money are not being wasted.
- On-call Effectiveness and Incident Management Best Practices
- Having a solid on-call is very important to achieving peak reliability. This lesson discusses how to have balanced on-call shifts with a solid incident management process that your team can follow.
Taught by
Richard Phung, Travis Scotto and Sonny Sevin