Master AI & Data—50% Off Udacity (Code CC50)
AI Adoption - Drive Business Value and Organizational Impact
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore IBM's innovative approach to building AI-HPC clusters for Large Language Model training through this conference presentation from SNIA Storage Developer Conference 2025. Discover how IBM's Cloud Vela cluster challenges conventional HPC design by leveraging public cloud infrastructure, virtual machines, Ethernet networking, and Kubernetes orchestration to manage AI workloads. Learn about the unique I/O requirements of distributed training jobs, including training data reading and large periodic checkpoint writing, and understand why traditional cloud storage options fall short for these demanding workloads. Examine the architectural details of IBM's tiered storage solution that implements GPFS distributed file system over cloud object storage, addressing both performance and semantic challenges. Gain insights into the design of IBM's data mover solution and its seamless integration with Kubernetes for automated file system volume provisioning backed by object buckets. Understand the practical challenges encountered in observability and cache sizing when deploying storage systems for AI training in cloud environments. This presentation provides valuable lessons for architects and engineers looking to implement cost-effective, cloud-native storage solutions for machine learning and AI training workloads.
Syllabus
SNIA SDC 2025 - Model Training in Public Clouds: Case for IBM Storage Scale
Taught by
SNIAVideo