Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

QFactory - Accelerating Quantized Large Language Model Serving with Qtile Graphs

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about QFactory, an innovative compilation framework designed to accelerate quantized large language model serving through a 14-minute conference presentation from USENIX ATC '25. Discover how researchers from Tsinghua University address the performance limitations of existing quantization methods that rely on static eager execution paradigms for weight dequantization operations. Explore the novel Qtile abstraction that enables efficient representation of quantized tensors by transforming traditional tensor computation graphs into Qtile-graphs (Qgraphs). Understand how QFactory leverages graph-level Qtile computation transformations to generate equivalent QGraphs, expanding optimization search spaces, followed by operator-level Qtile scheduling to identify optimal memory loading strategies. Examine the experimental results demonstrating QFactory's 1.66× average performance improvement over existing systems and 1.23× end-to-end generation speedup when integrated into state-of-the-art large language model serving systems, making this essential viewing for researchers and practitioners working on large language model optimization and quantization techniques.

Syllabus

USENIX ATC '25 - QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs

Taught by

USENIX

Reviews

Start your review of QFactory - Accelerating Quantized Large Language Model Serving with Qtile Graphs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.