Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about QFactory, an innovative compilation framework designed to accelerate quantized large language model serving through a 14-minute conference presentation from USENIX ATC '25. Discover how researchers from Tsinghua University address the performance limitations of existing quantization methods that rely on static eager execution paradigms for weight dequantization operations. Explore the novel Qtile abstraction that enables efficient representation of quantized tensors by transforming traditional tensor computation graphs into Qtile-graphs (Qgraphs). Understand how QFactory leverages graph-level Qtile computation transformations to generate equivalent QGraphs, expanding optimization search spaces, followed by operator-level Qtile scheduling to identify optimal memory loading strategies. Examine the experimental results demonstrating QFactory's 1.66× average performance improvement over existing systems and 1.23× end-to-end generation speedup when integrated into state-of-the-art large language model serving systems, making this essential viewing for researchers and practitioners working on large language model optimization and quantization techniques.
Syllabus
USENIX ATC '25 - QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs
Taught by
USENIX