Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This conference talk presents Mooncake, the serving platform for Kimi chatbot developed by Moonshot AI, which won the Best Paper award at USENIX FAST '25. Explore a KVCache-centric disaggregated architecture that separates prefill and decoding clusters while efficiently utilizing CPU, DRAM, SSD, and NIC resources in GPU clusters. Learn how Mooncake's global cache and specialized scheduler maximize throughput while meeting strict latency requirements. The presentation showcases impressive results with Mooncake increasing effective request capacity by 59-498% compared to baseline methods in long-context scenarios. Discover how this architecture, currently processing over 100 billion tokens daily across thousands of nodes, enables Kimi to handle 115% and 107% more requests on NVIDIA A800 and H800 clusters respectively compared to previous systems.

Syllabus

FAST '25 - Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture...

Taught by

USENIX

Reviews

Start your review of Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.