Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore critical networking challenges that impact AI/ML job performance and reliability in this 21-minute conference talk from the Linux Foundation's Open Source Summit. Discover how AI/ML workloads, like high-performance race cars, require optimized network fabrics such as RoCE and InfiniBand to achieve peak efficiency and speed. Learn about key networking issues including NIC flapping that reduces reliability and limited visibility at the queue pair level that hampers troubleshooting. Understand how these challenges act like debris on a race track, causing slowdowns, disruptions, and costly rollbacks to previous checkpoints that directly impact ROI. Watch practical demonstrations showing the real-world effects of networking problems on AI/ML job completion times and overall system performance. Gain essential knowledge about network fabric optimization and monitoring techniques that AI/ML engineers need to ensure their workloads run at full speed, drawing parallels to how pit crews maintain race cars for optimal performance.
Syllabus
AI/ML Networking Challenges: The Fast and the Finicky! - Lerna Ekmekcioglu, Clockwork Systems
Taught by
Linux Foundation