Learn Generative AI, Prompt Engineering, and LLMs for Free
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore high-performance distributed model training techniques in this AWS re:Invent 2024 conference session focused on Amazon SageMaker's capabilities. Discover advanced parallelization techniques, communication optimizations, and efficient checkpointing strategies for distributing training workloads across hundreds or thousands of GPUs. Learn how to effectively handle foundation models with billions or trillions of parameters that exceed single GPU capacity, while reducing model training time and costs by up to 20%. Dive deep into the infrastructure requirements for scaling distributed training and master the integration of SageMaker training capabilities to optimize the total cost of foundation model development.
Syllabus
AWS re:Invent 2024 - High performance distributed model training with Amazon SageMaker (AIM380)
Taught by
AWS Events