Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Challenges of Making Large AI Clusters Reliable

USENIX via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn about the unique reliability challenges faced when operating large-scale AI clusters in this conference talk from SREcon25 Europe/Middle East/Africa. Discover how High Performance Computing clusters differ fundamentally from typical datacenter hardware and why the traditional SRE approach of treating servers as cattle rather than pets doesn't apply in AI infrastructure environments. Explore the specific impacts these differences have on how Site Reliability Engineers must adapt their strategies when building systems on top of lower infrastructure layers. Gain insights from industry experts John Looney and Panos Christeas from Crusoe.ai as they share practical experiences and solutions for maintaining reliability in AI computing environments where hardware failures and performance variations can significantly impact large-scale machine learning workloads.

Syllabus

SREcon25 Europe/Middle East/Africa - Challenges of Making Large AI Clusters Reliable

Taught by

USENIX

Reviews

Start your review of Challenges of Making Large AI Clusters Reliable

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.