Learn how Luma AI's Head of ML Infrastructure, Keegan McCallum, scaled their video generation platform from launch to serving 1 million users in just 4 days in this 19-minute conference talk from the AI Engineer World's Fair. Discover the technical challenges and solutions behind one of the most successful AI product launches, including how the team rapidly scaled from 500 to 9,000 H100 GPUs within hours to handle unexpected demand. Explore Luma AI's mission to build general multimodal intelligence that can generate, understand, and operate in the physical world, going beyond simple video models. Examine their product capabilities including the "modify video" feature that transforms iPhone videos using text prompts and their public API for application integration. Understand the infrastructure re-architecture from a brittle Triton inference server setup to a custom-built serving stack on vanilla PyTorch that better supports multiple GPUs, nodes, and different chipsets. Dive deep into scaling challenges and their innovative solutions, including back pressure management through dispatch limitation systems, fair scheduling using SLO-based prioritization to prevent work starvation across different user tiers (API, enterprise, unlimited, light, free), automatic compute scaling on training clusters to handle demand bursts, and a sophisticated model repository system with immutable versions stored in object storage that enables reproducible rollbacks and seamless version switching. Gain insights into the technical war stories and lessons learned from launching a cutting-edge AI product at massive scale.

Syllabus

The initial launch challenges [00:00]: Luma AI was unprepared for the high traffic, quickly exhausting their initial GPU allocation and facing a large queue of requests.
Rapid scaling efforts [00:57]: They rapidly scaled their GPU capacity from 500 to 5,000 H100 GPUs within six hours, and later added another 4,000 H100 GPUs from their training cluster to keep up with demand.
Luma AI's mission [03:10]: Beyond just video models, Luma AI aims to build general multimodal intelligence that can generate, understand, and operate in the physical world.
Their product capabilities [03:22]: They demonstrate a "modify video" feature where users can upload iPhone videos and transform them with text prompts. They also highlight their public API for integrating this functionality into applications [03:52].
Infrastructure re-architecture [06:02]: They moved from a brittle, tightly coupled container setup using Triton inference server to a custom-built serving stack on vanilla PyTorch, which offers better support for multiple GPUs, nodes, and different chipsets.
Challenges and solutions in scaling [07:39]:
Back pressure [07:51]: They implemented a dispatch limitation system to prevent too many CPU workers from queuing jobs in one cluster.
Fair scheduling and work starvation [08:36]: To address issues with different user tiers API, enterprise, unlimited, light, free and prevent lower-priority jobs from being starved, they developed an SLO Service Level Objective based system that prioritizes jobs based on the percentage of their worst-case waiting time [11:14].
Handling different models and bursts [08:43]: They built a system to automatically scale up compute on their training cluster to handle demand bursts [09:16].
Model management [13:24]: They use a model repository system where each model has immutable versions stored in object storage, including the full Python environment and checkpoints. This allows for reproducible rollbacks and seamless, on-the-fly version switching for workers [14:46].
Hiring [15:13]: Luma AI is actively hiring engineers, researchers, and AI enthusiasts