Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

NVIDIA Run:ai and Amazon SageMaker HyperPod Integration for Distributed Training

AWS Events via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how NVIDIA Run:ai integrates with Amazon SageMaker HyperPod to simplify and scale large AI training workloads in this 51-minute conference presentation from AWS re:Invent 2025. Discover how SageMaker HyperPod provides robust clusters for resilient, distributed training while Run:ai adds centralized GPU management, job scheduling, quota enforcement, and dynamic hybrid-cloud bursting capabilities. Learn how this integration enables organizations to seamlessly run, shift, and resume workloads across on-premises and cloud resources, improving GPU utilization and resilience. Examine real-world scenarios including multi-cluster training, elastic PyTorch jobs, inference operations, and Jupyter development environments that demonstrate streamlined, efficient, and flexible AI infrastructure management. Gain insights into how this combined solution addresses the challenges of managing distributed AI training at scale while optimizing resource utilization across hybrid cloud environments.

Syllabus

AWS re:Invent 2025 - NVIDIA Run:ai & Amazon SageMaker HyperPod Integration for Distributed Training

Taught by

AWS Events

Reviews

Start your review of NVIDIA Run:ai and Amazon SageMaker HyperPod Integration for Distributed Training

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.