Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Model Training in Public Clouds - Case for IBM Storage Scale

SNIAVideo via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore IBM's innovative approach to building AI-HPC clusters for Large Language Model training through this conference presentation from SNIA Storage Developer Conference 2025. Discover how IBM's Cloud Vela cluster challenges conventional HPC design by leveraging public cloud infrastructure, virtual machines, Ethernet networking, and Kubernetes orchestration to manage AI workloads. Learn about the unique I/O requirements of distributed training jobs, including training data reading and large periodic checkpoint writing, and understand why traditional cloud storage options fall short for these demanding workloads. Examine the architectural details of IBM's tiered storage solution that implements GPFS distributed file system over cloud object storage, addressing both performance and semantic challenges. Gain insights into the design of IBM's data mover solution and its seamless integration with Kubernetes for automated file system volume provisioning backed by object buckets. Understand the practical challenges encountered in observability and cache sizing when deploying storage systems for AI training in cloud environments. This presentation provides valuable lessons for architects and engineers looking to implement cost-effective, cloud-native storage solutions for machine learning and AI training workloads.

Syllabus

SNIA SDC 2025 - Model Training in Public Clouds: Case for IBM Storage Scale

Taught by

SNIAVideo

Reviews

Start your review of Model Training in Public Clouds - Case for IBM Storage Scale

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.