NVIDIA Run:ai and Amazon SageMaker HyperPod Integration for Distributed Training

Explore how NVIDIA Run:ai integrates with Amazon SageMaker HyperPod to simplify and scale large AI training workloads in this 51-minute conference presentation from AWS re:Invent 2025. Discover how SageMaker HyperPod provides robust clusters for resilient, distributed training while Run:ai adds centralized GPU management, job scheduling, quota enforcement, and dynamic hybrid-cloud bursting capabilities. Learn how this integration enables organizations to seamlessly run, shift, and resume workloads across on-premises and cloud resources, improving GPU utilization and resilience. Examine real-world scenarios including multi-cluster training, elastic PyTorch jobs, inference operations, and Jupyter development environments that demonstrate streamlined, efficient, and flexible AI infrastructure management. Gain insights into how this combined solution addresses the challenges of managing distributed AI training at scale while optimizing resource utilization across hybrid cloud environments.