QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

This conference talk presents research on "QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving" at MLSys 2025, delivered at the Santa Clara Convention Center. Explore how researchers from MIT HAN Lab developed a novel approach combining weight-4 and activation-8 quantization with key-value cache quantization to improve large language model (LLM) serving efficiency. The 14-minute presentation by Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han details their innovative system co-design that addresses both model size and inference latency challenges. Learn about their methodology, implementation, and performance results that demonstrate significant improvements in LLM serving efficiency. Additional resources including the project website, research paper, and code repository are available for those interested in implementing or building upon this work.