Scaling of Quantized Large Language Models for Efficient Inference

Explore the intersection of network quantization and large language models through this 33-minute conference talk that examines decade-old quantization theories from a fresh perspective in the LLM era. Discover how scaling laws predict model quality returns on training computation investment while uncertainty remains high for post-training quantization during inference deployment. Learn about the additional factors that govern LLM scaling after quantization and investigate whether empirical scaling laws can illuminate LLM quantization effectiveness. Gain theoretical insights into the challenges and opportunities of network compression practices through recent research findings. Understand the critical considerations for deploying quantized LLMs efficiently in production environments, with practical implications for AI accelerator applications and algorithm-hardware codesign approaches.