NVIDIA Run:ai and Amazon SageMaker HyperPod Integration for Distributed Training
AWS Events via YouTube
PowerBI Data Analyst - Create visualizations and dashboards from scratch
The Fastest Way to Become a Backend Developer Online
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore how NVIDIA Run:ai integrates with Amazon SageMaker HyperPod to simplify and scale large AI training workloads in this 51-minute conference presentation from AWS re:Invent 2025. Discover how SageMaker HyperPod provides robust clusters for resilient, distributed training while Run:ai adds centralized GPU management, job scheduling, quota enforcement, and dynamic hybrid-cloud bursting capabilities. Learn how this integration enables organizations to seamlessly run, shift, and resume workloads across on-premises and cloud resources, improving GPU utilization and resilience. Examine real-world scenarios including multi-cluster training, elastic PyTorch jobs, inference operations, and Jupyter development environments that demonstrate streamlined, efficient, and flexible AI infrastructure management. Gain insights into how this combined solution addresses the challenges of managing distributed AI training at scale while optimizing resource utilization across hybrid cloud environments.
Syllabus
AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training
Taught by
AWS Events