Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how Site Reliability Engineering practices are evolving to meet the demands of AI infrastructure in this 36-minute conference talk from SREcon25 Americas. Qian Ding from Ant Group examines the unique challenges that arise as AI models grow in complexity and scale, requiring specialized approaches to infrastructure management. Learn about the specific difficulties in managing GPU-accelerated clusters, including effective anomaly detection, node lifecycle management, and addressing the distinctive requirements of AI workloads. Gain valuable insights from real-world experiences and practical lessons that can help SREs navigate this new technological frontier while ensuring reliability, scalability, and optimal performance of AI infrastructure systems.
Syllabus
SREcon25 Americas - Transformers in SRE Land: Evolving to Manage AI Infrastructure
Taught by
USENIX