Stuck in Tutorial Hell? Learn Backend Dev the Right Way
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore how Site Reliability Engineering practices are evolving to meet the demands of AI infrastructure in this 36-minute conference talk from SREcon25 Americas. Qian Ding from Ant Group examines the unique challenges that arise as AI models grow in complexity and scale, requiring specialized approaches to infrastructure management. Learn about the specific difficulties in managing GPU-accelerated clusters, including effective anomaly detection, node lifecycle management, and addressing the distinctive requirements of AI workloads. Gain valuable insights from real-world experiences and practical lessons that can help SREs navigate this new technological frontier while ensuring reliability, scalability, and optimal performance of AI infrastructure systems.
Syllabus
SREcon25 Americas - Transformers in SRE Land: Evolving to Manage AI Infrastructure
Taught by
USENIX