Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

MIT HAN Lab via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This conference talk presents research on "QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving" at MLSys 2025, delivered at the Santa Clara Convention Center. Explore how researchers from MIT HAN Lab developed a novel approach combining weight-4 and activation-8 quantization with key-value cache quantization to improve large language model (LLM) serving efficiency. The 14-minute presentation by Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han details their innovative system co-design that addresses both model size and inference latency challenges. Learn about their methodology, implementation, and performance results that demonstrate significant improvements in LLM serving efficiency. Additional resources including the project website, research paper, and code repository are available for those interested in implementing or building upon this work.

Syllabus

MLSys'25 - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Taught by

MIT HAN Lab

Reviews

Start your review of QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.