Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

An SRE Approach to Monitoring ML in Production

USENIX via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
This 39-minute conference talk from SREcon25 Americas explores how Site Reliability Engineering (SRE) principles can be applied to monitoring Machine Learning systems in production environments. Presented by Daria Barteneva from Microsoft Azure, discover the challenges of operationalizing ML within large distributed systems and the expertise gap between ML development and production reliability. Learn how to decompose complex ML systems into observable components, understand why traditional observability practices fall short for ML workloads, and explore mechanisms for monitoring end-to-end system reliability and quality. Gain practical insights for SREs who are or will soon be responsible for serving ML models at scale in production environments.

Syllabus

SREcon25 Americas - An SRE Approach to Monitoring ML in Production

Taught by

USENIX

Reviews

Start your review of An SRE Approach to Monitoring ML in Production

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.