Google AI Professional Certificate - Learn AI Skills That Get You Hired
Master AI and Machine Learning: From Neural Networks to Applications
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about NanoFlow, a novel serving framework designed to optimize Large Language Model (LLM) serving throughput in this 16-minute conference presentation from OSDI '25. Discover how researchers from University of Washington, Tsinghua University, University of California Berkeley, and University of Michigan challenge the common assumption that LLM serving is memory-bound by demonstrating through detailed analysis that end-to-end LLM serving is actually compute-bound for most common workloads and LLMs. Explore the key insight that existing serving engines fail to achieve optimal compute utilization because heterogeneous operations comprising LLM serving—compute, memory, and networking—are executed sequentially within a device. Understand how NanoFlow exploits intra-device parallelism by overlapping the usage of heterogeneous resources within a single device through splitting inputs into smaller nano-batches and duplicating operations to operate on each portion independently. Examine the automatic optimization process that identifies the optimal number, size, ordering, and GPU resource allocation of nano-batches to minimize execution time while considering interference from concurrent operations. Review comprehensive evaluation results showing NanoFlow's performance on popular models including LLaMA-2-70B, Mixtral 8×7B, and LLaMA-3-8B, where the framework achieves a 1.91× throughput boost compared to state-of-the-art serving systems and reaches 50% to 72% of optimal throughput across popular models with practical workloads.
Syllabus
OSDI '25 - NanoFlow: Towards Optimal Large Language Model Serving Throughput
Taught by
USENIX