Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Dashboards and Dragons - Reliability Magic for AI Platforms

USENIX via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn how to build reliable and scalable generative AI platforms through this conference talk from SREcon25 EMEA. Discover practical strategies for managing the complex infrastructure challenges that come with GenAI systems, including heavy resource demands, token-based scaling patterns, and comprehensive monitoring requirements. Explore how Bloomberg engineers tackled the reliability challenges of multi-cluster AI platforms using actionable Service Level Objectives (SLOs) as their guiding framework. Gain insights into making AI systems observable, debuggable, and resilient through open source-friendly approaches that address the unique complexities of training workloads, inference services, and underlying infrastructure. Master practical techniques for scaling generative AI platforms while maintaining reliability standards, whether you're building your first AI platform or optimizing existing cluster deployments.

Syllabus

SREcon25 Europe/Middle East/Africa - Dashboards & Dragons: Reliability Magic for AI Platforms

Taught by

USENIX

Reviews

Start your review of Dashboards and Dragons - Reliability Magic for AI Platforms

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.