AI Adoption - Drive Business Value and Organizational Impact
Introduction to Programming with Python
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore high-performance distributed model training techniques in this AWS re:Invent 2024 conference session focused on Amazon SageMaker's capabilities. Discover advanced parallelization techniques, communication optimizations, and efficient checkpointing strategies for distributing training workloads across hundreds or thousands of GPUs. Learn how to effectively handle foundation models with billions or trillions of parameters that exceed single GPU capacity, while reducing model training time and costs by up to 20%. Dive deep into the infrastructure requirements for scaling distributed training and master the integration of SageMaker training capabilities to optimize the total cost of foundation model development.
Syllabus
AWS re:Invent 2024 - High performance distributed model training with Amazon SageMaker (AIM380)
Taught by
AWS Events