Optimizing Data Locality and GPU Utilization for Training Workloads in Kubernetes

Learn how to optimize data locality and GPU utilization for machine learning training workloads in Kubernetes environments through this 32-minute conference talk. Explore the significant data processing and storage challenges organizations face when scaling model training workloads in cloud-native environments, including managing massive training datasets across distributed storage systems while maintaining optimal I/O performance. Discover how Kubernetes excels at compute orchestration but struggles with data distribution across multiple storage backends, creating bottlenecks that impact training performance and infrastructure costs. Examine a Kubernetes-native distributed caching system that leverages NVMe storage to overcome data locality challenges and improve overall system performance. Gain insights from real-world, large-scale production use cases demonstrating how this architecture reduces data infrastructure costs, increases GPU utilization rates, and enables workload portability to address GPU scarcity challenges in modern cloud-native machine learning deployments.