LLM Inference - A Comparative Guide to Modern Open-Source Runtimes

Explore a comprehensive technical deep dive into deploying large language models at scale through this 52-minute conference talk from the MLOps World GenAI Summit 2025. Learn how the Wildberries AI team built and battle-tested a production-grade LLM serving platform using multiple open-source runtimes including vLLM, Triton TensorRT-LLM, Text Generation Inference (TGI), and SGLang. Discover their custom benchmarking setup, understand the trade-offs across different runtimes, and gain insights into when each framework makes sense based on model size, latency targets, and workload patterns. Examine practical implementation strategies including HPA for vLLM, reducing cold start times with Tensorize, co-locating multiple vLLM models per pod to optimize GPU memory usage, and implementing SAQ-based queue wrappers for fair and efficient request handling. Understand how to wrap endpoints with Kong for per-user rate limits, token quotas, and observability. Get real-world insights from running DeepSeek R1-0528 in production while maintaining flexibility and controlling costs and complexity. Master the fundamentals of why there's no single best LLM serving stack, how to benchmark and deploy multiple runtimes effectively, the specific trade-offs between frameworks like vLLM, TGI, Triton, and SGLang, and how to design an LLM inference setup that fits your specific use case requirements.