Optimizing Data Locality and GPU Utilization for Training Workloads in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to optimize data locality and GPU utilization for machine learning training workloads in Kubernetes environments through this 32-minute conference talk. Explore the significant data processing and storage challenges organizations face when scaling model training workloads in cloud-native environments, including managing massive training datasets across distributed storage systems while maintaining optimal I/O performance. Discover how Kubernetes excels at compute orchestration but struggles with data distribution across multiple storage backends, creating bottlenecks that impact training performance and infrastructure costs. Examine a Kubernetes-native distributed caching system that leverages NVMe storage to overcome data locality challenges and improve overall system performance. Gain insights from real-world, large-scale production use cases demonstrating how this architecture reduces data infrastructure costs, increases GPU utilization rates, and enables workload portability to address GPU scarcity challenges in modern cloud-native machine learning deployments.
Syllabus
Optimizing Data Locality and GPU Utilization for Training Workloads in Kubernetes - Bin Fan, Alluxio
Taught by
CNCF [Cloud Native Computing Foundation]