Learn about SRE, an engineering discipline that helps you sustainably achieve the appropriate level of reliability in your systems, services, and products.
In this module you will:
- Gain a basic understanding of Site Reliability Engineering (SRE).
- Learn how to get started with this valuable operations practice.
Learn how to manage site reliability.
After completing this module, you'll be able to:
- Describe how site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production.
- Describe how Application Insights analyzes the performance of your web application and can warn you about potential problems.
- List the processes that you can implement to monitor site reliability.
- Build a "just culture" that balances safety and accountability.
Cloud Admin course from Dr. Majd Sakr at Carnegie Mellon University. Discover what cloud elasticity means and different ways to scale your cloud resources.
In this module you will:
- Describe common load patterns and how they drive the need to scale
- Enumerate the strategies and considerations in scaling cloud applications
- Discuss the advantages of auto-scaling and the mechanisms used to achieve it
- Describe the importance of load balancing in cloud applications and enumerate various methods to achieve it
- List the primary benefits of serverless computing and explain the concept of serverless functions
This content is provided in partnership with Dr. Majd Sakr and Carnegie Mellon University.
Carnegie Mellon University's Cloud Developer course. Learn how developers write programs that run on the cloud, including how to deploy, be fault-tolerant, load balance, scale, and deal with latency.
In this module, you will:
- Evaluate different considerations when programming applications that run on clouds
- Evaluate different considerations when deploying applications on clouds
- Compare and contrast proactive and reactive measures for fault tolerance in cloud applications
- Describe the importance of load balancing in cloud applications and enumerate various methods to achieve it
- Enumerate the strategies and considerations in scaling cloud applications
- Motivate the case for minimizing tail latency and discuss the various strategies to reduce tail latency
- Describe the strategies to optimize total operational cost of using cloud services
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
Learn how to monitor your Azure VMs by using Azure Monitor to collect and analyze VM host and client metrics and logs.
- Understand which monitoring data you need to collect from your VM.
- Enable and view recommended alerts and diagnostics.
- Use Azure Monitor to collect and analyze VM host metrics data.
- Use Azure Monitor Agent to collect VM client performance metrics and event logs.

Syllabus

Introduction to Site Reliability Engineering (SRE)
- Introduction to site reliability engineering
- What is SRE and why does it matter?
- SRE in context
- Key SRE principles and practices: virtuous cycles
- Key SRE principles and practices: the human side of SRE
- Getting started with SRE
- Summary
Manage site reliability
- Introduction
- What is reliability engineering?
- What is Application Insights?
- Perform ongoing tuning to reduce meaningless alerts
- Analyze alerts to establish a baseline
- Blameless postmortems
- Module assessment
- Summary
Scale your cloud resources with elasticity
- Introduction
- Compute load patterns
- Scaling compute resources
- Automated scaling on the cloud
- Load balancing
- Serverless computing
- Summary
Build applications on the cloud
- Introduction
- Programming the cloud
- Deploy applications on the cloud
- Build fault-tolerant cloud services
- Load balancing
- Scale resources
- How to deal with tail latency
- Economics for cloud applications
- Summary
Monitor your Azure virtual machines with Azure Monitor
- Introduction
- Monitoring for Azure VMs
- Monitor VM host data
- Use Metrics Explorer to view detailed host metrics
- Collect client performance counters by using VM insights
- Collect VM client event logs
- Summary