Optimizing Data Locality and GPU Utilization for Training Workloads in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
AI Engineer - Learn how to integrate AI into software applications
Lead AI Strategy with UCSB's Agentic AI Program — Microsoft Certified
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn how to optimize data locality and GPU utilization for machine learning training workloads in Kubernetes environments through this 32-minute conference talk. Explore the significant data processing and storage challenges organizations face when scaling model training workloads in cloud-native environments, including managing massive training datasets across distributed storage systems while maintaining optimal I/O performance. Discover how Kubernetes excels at compute orchestration but struggles with data distribution across multiple storage backends, creating bottlenecks that impact training performance and infrastructure costs. Examine a Kubernetes-native distributed caching system that leverages NVMe storage to overcome data locality challenges and improve overall system performance. Gain insights from real-world, large-scale production use cases demonstrating how this architecture reduces data infrastructure costs, increases GPU utilization rates, and enables workload portability to address GPU scarcity challenges in modern cloud-native machine learning deployments.
Syllabus
Optimizing Data Locality and GPU Utilization for Training Workloads in Kubernetes - Bin Fan, Alluxio
Taught by
CNCF [Cloud Native Computing Foundation]