Free courses from frontend to fullstack and AI
Google AI Professional Certificate - Learn AI Skills That Get You Hired
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This conference talk from Conf42 SRE 2025 explores how to achieve zero downtime in machine learning deployments. Discover the critical gap between ML engineers and SREs, understand why traditional observability approaches fall short for ML systems, and learn strategies for detecting silent failures and data drifts. Explore implementation techniques for effective ML monitoring systems, understand different types of data drifts and their operational impact, and master best practices for ML observability. The presentation covers essential tools and techniques to maintain continuous service while deploying ML models, concluding with actionable insights for maintaining reliable ML systems in production environments.
Syllabus
00:00 Introduction to Zero Downtime ML Observability
01:07 Understanding the Gap Between ML Engineers and SREs
02:43 Challenges in Traditional Observability for ML Systems
04:45 Addressing Silent Failures and Data Drifts
08:05 Implementing Effective ML Monitoring Systems
10:37 Types of Data Drifts and Their Impact
16:50 Best Practices for ML Observability
18:52 Tools and Techniques for Zero Downtime
20:31 Conclusion and Final Thoughts
Taught by
Conf42