Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot

This conference talk presents Mooncake, the serving platform for Kimi chatbot developed by Moonshot AI, which won the Best Paper award at USENIX FAST '25. Explore a KVCache-centric disaggregated architecture that separates prefill and decoding clusters while efficiently utilizing CPU, DRAM, SSD, and NIC resources in GPU clusters. Learn how Mooncake's global cache and specialized scheduler maximize throughput while meeting strict latency requirements. The presentation showcases impressive results with Mooncake increasing effective request capacity by 59-498% compared to baseline methods in long-context scenarios. Discover how this architecture, currently processing over 100 billion tokens daily across thousands of nodes, enables Kimi to handle 115% and 107% more requests on NVIDIA A800 and H800 clusters respectively compared to previous systems.