NVIDIA Run:ai and Amazon SageMaker HyperPod Integration for Distributed Training
AWS Events via YouTube
AI Engineer - Learn how to integrate AI into software applications
Get 35% Off CFI Certifications - Code CFI35
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how NVIDIA Run:ai integrates with Amazon SageMaker HyperPod to simplify and scale large AI training workloads in this 51-minute conference presentation from AWS re:Invent 2025. Discover how SageMaker HyperPod provides robust clusters for resilient, distributed training while Run:ai adds centralized GPU management, job scheduling, quota enforcement, and dynamic hybrid-cloud bursting capabilities. Learn how this integration enables organizations to seamlessly run, shift, and resume workloads across on-premises and cloud resources, improving GPU utilization and resilience. Examine real-world scenarios including multi-cluster training, elastic PyTorch jobs, inference operations, and Jupyter development environments that demonstrate streamlined, efficient, and flexible AI infrastructure management. Gain insights into how this combined solution addresses the challenges of managing distributed AI training at scale while optimizing resource utilization across hybrid cloud environments.
Syllabus
AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training
Taught by
AWS Events