Practice of Building AI Training Clusters Based on Kubernetes and RoCEv2
CNCF [Cloud Native Computing Foundation] via YouTube
AI, Data Science & Cloud Certificates from Google, IBM & Meta
Get 20% off all career paths from fullstack to AI
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore the practice of building AI training clusters using Kubernetes and RoCEv2 in this 42-minute conference talk. Learn how to integrate RoCEv2 lossless networks into Kubernetes, utilize RoCEv2 networks in Kubernetes pods, optimize resource scheduling for nodes with multiple GPUs and RoCE network cards, and make necessary adjustments to AI training tasks. Discover solutions for challenges such as network card virtualization, RoCE lossless network configuration, and running training tasks based on RoCEv2 and Kubernetes. Gain insights into the advantages and implementation strategies of using RoCEv2 networks over traditional Infiniband networks for AI cluster construction.
Syllabus
Practice of Building AI Training Cluster Based on Kubernetes+RoCEv2 - Wang DeKui & Wang Chao IEI
Taught by
CNCF [Cloud Native Computing Foundation]