Learn EDR Internals: Research & Development From The Masters
AI, Data Science & Cloud Certificates from Google, IBM & Meta
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore critical networking challenges that impact AI/ML job performance and reliability in this 21-minute conference talk from the Linux Foundation's Open Source Summit. Discover how AI/ML workloads, like high-performance race cars, require optimized network fabrics such as RoCE and InfiniBand to achieve peak efficiency and speed. Learn about key networking issues including NIC flapping that reduces reliability and limited visibility at the queue pair level that hampers troubleshooting. Understand how these challenges act like debris on a race track, causing slowdowns, disruptions, and costly rollbacks to previous checkpoints that directly impact ROI. Watch practical demonstrations showing the real-world effects of networking problems on AI/ML job completion times and overall system performance. Gain essential knowledge about network fabric optimization and monitoring techniques that AI/ML engineers need to ensure their workloads run at full speed, drawing parallels to how pit crews maintain race cars for optimal performance.
Syllabus
AI/ML Networking Challenges: The Fast and the Finicky! - Lerna Ekmekcioglu, Clockwork Systems
Taught by
Linux Foundation