NanoFlow - Towards Optimal Large Language Model Serving Throughput

Learn about NanoFlow, a novel serving framework designed to optimize Large Language Model (LLM) serving throughput in this 16-minute conference presentation from OSDI '25. Discover how researchers from University of Washington, Tsinghua University, University of California Berkeley, and University of Michigan challenge the common assumption that LLM serving is memory-bound by demonstrating through detailed analysis that end-to-end LLM serving is actually compute-bound for most common workloads and LLMs. Explore the key insight that existing serving engines fail to achieve optimal compute utilization because heterogeneous operations comprising LLM serving—compute, memory, and networking—are executed sequentially within a device. Understand how NanoFlow exploits intra-device parallelism by overlapping the usage of heterogeneous resources within a single device through splitting inputs into smaller nano-batches and duplicating operations to operate on each portion independently. Examine the automatic optimization process that identifies the optimal number, size, ordering, and GPU resource allocation of nano-batches to minimize execution time while considering interference from concurrent operations. Review comprehensive evaluation results showing NanoFlow's performance on popular models including LLaMA-2-70B, Mixtral 8×7B, and LLaMA-3-8B, where the framework achieves a 1.91× throughput boost compared to state-of-the-art serving systems and reaches 50% to 72% of optimal throughput across popular models with practical workloads.