Orchestration Needs for AI Clusters at Scale - Lessons Learned from Two Leading Providers
Open Compute Project via YouTube
Get 20% off all career paths from fullstack to AI
Start speaking a new language. It’s just 3 weeks away.
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about orchestration and operations requirements for large-scale AI clusters in this 15-minute technical talk from Supermicro and Broadcom experts. Explore real-world examples and solutions using SONiC for managing thousands of switches and tens of thousands of links at scale. Discover key considerations including accelerator vendors, InfiniBand vs Ethernet fabrics, templated scale unit designs, and switch/adapter orchestration. Master the process of translating high-level requirements into practical designs, automating Day 0 and Day 1 deployments, validating implementations, and implementing Day 2 monitoring. Gain insights into preventing configuration drift, leveraging telemetry for performance optimization, and managing multi-tenant environments effectively.
Syllabus
Orchestration needs for AI clusters at scale – Lessons learned from two leading providers
Taught by
Open Compute Project