LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Explore a 12-minute conference talk from MLSys 2025 presenting "LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention," delivered by researchers from MIT HAN Lab. Discover how this innovative approach addresses the challenges of serving large language models with long input sequences. Learn about the unified sparse attention mechanism that enhances efficiency in LLM deployment. The presentation, held on May 14th at Santa Clara Convention Center in California, features insights from authors Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. Access additional resources including the project website, research paper, and GitHub repository to further understand this cutting-edge work in machine learning systems.