Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how to achieve instant GPU scaling for serverless AI inference in this conference talk from AWS re:Invent 2025. Discover Modal's innovative approach to building a Rust-based container stack that can dynamically spin thousands of GPUs up and down within seconds, addressing the critical challenge of cost-efficient GPU compute for generative AI model deployment. Learn about the technical architecture and strategies that enable flexible GPU access while maximizing cost efficiency in volatile production inference environments. Understand how Modal leverages AWS's compute and storage products to deliver instant, scalable GPU resources to ML engineers across organizations of all sizes, from startups to large enterprises. Gain insights into solving the fundamental tension between expensive GPU resources and unpredictable inference demand patterns in modern AI applications.
Syllabus
AWS re:Invent 2025 - Scaling instantly to 1000 GPUs for Serverless AI inference (AIM2201)
Taught by
AWS Events