Stuck in Tutorial Hell? Learn Backend Dev the Right Way
Google Data Analytics, IBM AI & Meta Marketing — All in One Subscription
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how to build reliable and scalable generative AI platforms through this conference talk from SREcon25 EMEA. Discover practical strategies for managing the complex infrastructure challenges that come with GenAI systems, including heavy resource demands, token-based scaling patterns, and comprehensive monitoring requirements. Explore how Bloomberg engineers tackled the reliability challenges of multi-cluster AI platforms using actionable Service Level Objectives (SLOs) as their guiding framework. Gain insights into making AI systems observable, debuggable, and resilient through open source-friendly approaches that address the unique complexities of training workloads, inference services, and underlying infrastructure. Master practical techniques for scaling generative AI platforms while maintaining reliability standards, whether you're building your first AI platform or optimizing existing cluster deployments.
Syllabus
SREcon25 Europe/Middle East/Africa - Dashboards & Dragons: Reliability Magic for AI Platforms
Taught by
USENIX