Benchmarking Your Distributed ML Training on the K8s Platform
CNCF [Cloud Native Computing Foundation] via YouTube
AI Engineer - Learn how to integrate AI into software applications
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This lightning talk explores how to benchmark distributed machine learning training on Kubernetes platforms. Discover the challenges of running ML training workloads on Kubernetes, including dynamic resource scaling, GPU scheduling, and efficient inter-node communication. Learn about recent advancements like KubeRay, Kubeflow, and Slurm integration that have expanded Kubernetes' capabilities for handling complex, large-scale ML training tasks. Explore the design and implementation of a benchmarking platform that provides actionable insights to improve throughput, scalability, and efficiency of distributed ML training workloads on Kubernetes.
Syllabus
Lightning Talk: Benchmarking Your Distributed ML Training on the K8s Platform - Liang Yan, CoreWeave
Taught by
CNCF [Cloud Native Computing Foundation]