Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Unlocking Scalable Distributed Training With Arrow Data Cache on Kubernetes

Linux Foundation via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how to overcome data I/O bottlenecks in GPU-accelerated AI training workloads through an innovative Arrow-based data cache solution designed for Kubernetes environments. Learn about the challenges of efficiently feeding data into large-scale AI model training as datasets and model complexity continue to grow, particularly in cloud-native environments where elasticity and performance are critical. Discover how this open-source solution leverages Apache Arrow's columnar format and zero-copy semantics to decouple data preprocessing from training jobs, enabling sharing of preprocessed datasets across distributed training nodes while reducing data loading overhead and improving GPU utilization. Understand the integration capabilities with Kubernetes-native orchestration tools including Kubeflow TrainJob, JobSet, LeaderWorkerSet, Volcano, and Kueue, and how this design pattern enables reproducibility, cache reuse, and enhanced performance across multi-tenant environments. Gain practical insights for building scalable, cloud-native training workloads that work effectively with PyTorch, TensorFlow, and JAX frameworks, specifically focusing on tabular datasets stored as Apache Iceberg tables and strategies to prevent data bottlenecks from limiting your distributed training performance.

Syllabus

Unlocking Scalable Distributed Training With Arrow Data Cache on Kubernetes - Ricardo Aravena, CNCF

Taught by

Linux Foundation

Reviews

Start your review of Unlocking Scalable Distributed Training With Arrow Data Cache on Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.