Embedded LLM's Guide to vLLM Architecture and High-Performance Serving

Explore vLLM's industry-leading architecture and high-performance serving capabilities through a comprehensive conference talk that delivers insights for users across all experience levels. Discover what makes vLLM exceptionally fast, flexible, and extensible through a detailed architectural walkthrough that uses a real-world case study of integrating AITER (AI Tensor Engine for ROCm) kernels and Multi-head Latent Attention (MLA) to accelerate DeepSeek-R1 inference on AMD GPUs. Learn how vLLM's internal components work, including CustomOps, attention mechanisms, and integration points for high-performance kernels that enable high throughput and rapid incorporation of new model architectures and hardware backends. Understand the system's core architecture to build a strong mental model for optimizing production deployments, while gaining practical knowledge about where new logic should live, how the Python API bridges to custom kernels, and common pitfalls such as custom ops not being invoked without proper registration. Master debugging workflows including tracing CPU overhead and benchmarking practices that help transform kernel prototypes into production-ready contributions, whether you're an MLOps engineer optimizing deployments or a kernel developer preparing to write the next major vLLM feature.