NVIDIA Run:ai and Amazon SageMaker HyperPod Integration for Distributed Training
AWS Events via YouTube
Earn a Michigan Engineering AI Certificate — Stay Ahead of the AI Revolution
AI, Data Science & Cloud Certificates from Google, IBM & Meta
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore how NVIDIA Run:ai integrates with Amazon SageMaker HyperPod to simplify and scale large AI training workloads in this 51-minute conference presentation from AWS re:Invent 2025. Discover how SageMaker HyperPod provides robust clusters for resilient, distributed training while Run:ai adds centralized GPU management, job scheduling, quota enforcement, and dynamic hybrid-cloud bursting capabilities. Learn how this integration enables organizations to seamlessly run, shift, and resume workloads across on-premises and cloud resources, improving GPU utilization and resilience. Examine real-world scenarios including multi-cluster training, elastic PyTorch jobs, inference operations, and Jupyter development environments that demonstrate streamlined, efficient, and flexible AI infrastructure management. Gain insights into how this combined solution addresses the challenges of managing distributed AI training at scale while optimizing resource utilization across hybrid cloud environments.
Syllabus
AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training
Taught by
AWS Events