Optimizing Data Locality and GPU Utilization for Training Workloads in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Cybersecurity: Ethical Hacking Fundamentals - Self Paced Online
MIT Sloan AI Adoption: Build a Playbook That Drives Real Business ROI
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how to optimize data locality and GPU utilization for machine learning training workloads in Kubernetes environments through this 32-minute conference talk. Explore the significant data processing and storage challenges organizations face when scaling model training workloads in cloud-native environments, including managing massive training datasets across distributed storage systems while maintaining optimal I/O performance. Discover how Kubernetes excels at compute orchestration but struggles with data distribution across multiple storage backends, creating bottlenecks that impact training performance and infrastructure costs. Examine a Kubernetes-native distributed caching system that leverages NVMe storage to overcome data locality challenges and improve overall system performance. Gain insights from real-world, large-scale production use cases demonstrating how this architecture reduces data infrastructure costs, increases GPU utilization rates, and enables workload portability to address GPU scarcity challenges in modern cloud-native machine learning deployments.
Syllabus
Optimizing Data Locality and GPU Utilization for Training Workloads in Kubernetes - Bin Fan, Alluxio
Taught by
CNCF [Cloud Native Computing Foundation]