AI Engineer - Learn how to integrate AI into software applications
AI Product Expert Certification - Master Generative AI Skills
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn how to build reliable and scalable generative AI platforms through this conference talk from SREcon25 EMEA. Discover practical strategies for managing the complex infrastructure challenges that come with GenAI systems, including heavy resource demands, token-based scaling patterns, and comprehensive monitoring requirements. Explore how Bloomberg engineers tackled the reliability challenges of multi-cluster AI platforms using actionable Service Level Objectives (SLOs) as their guiding framework. Gain insights into making AI systems observable, debuggable, and resilient through open source-friendly approaches that address the unique complexities of training workloads, inference services, and underlying infrastructure. Master practical techniques for scaling generative AI platforms while maintaining reliability standards, whether you're building your first AI platform or optimizing existing cluster deployments.
Syllabus
SREcon25 Europe/Middle East/Africa - Dashboards & Dragons: Reliability Magic for AI Platforms
Taught by
USENIX