MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
Scalable Parallel Computing Lab, SPCL @ ETH Zurich via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This conference talk presents MARLIN (Mixed-precision Auto-Regressive LINear kernels), a novel approach for efficient batched inference on quantized Large Language Models. Discover how the researchers from ETH Zurich's Scalable Parallel Computing Lab achieved substantial speedups for multi-user inference scenarios while maintaining the benefits of model weight quantization. Learn about the technical innovations that allow MARLIN to support batch sizes up to 16-32 with nearly maximum (4×) quantization speedup, and larger batch sizes with gradually decreasing but still significant acceleration. The presentation covers the combination of techniques including asynchronous memory access, complex task scheduling, pipelining, and specialized quantization support that enable these performance gains. See experimental results demonstrating how MARLIN's near-optimal performance on individual LLM layers translates to significant end-to-end inference speedups (up to 2.8×) when integrated with the vLLM serving engine, plus extensions to other compression techniques like NVIDIA 2:4 sparsity for additional performance improvements.
Syllabus
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
Taught by
Scalable Parallel Computing Lab, SPCL @ ETH Zurich