QFactory - Accelerating Quantized Large Language Model Serving with Qtile Graphs

Learn about QFactory, an innovative compilation framework designed to accelerate quantized large language model serving through a 14-minute conference presentation from USENIX ATC '25. Discover how researchers from Tsinghua University address the performance limitations of existing quantization methods that rely on static eager execution paradigms for weight dequantization operations. Explore the novel Qtile abstraction that enables efficient representation of quantized tensors by transforming traditional tensor computation graphs into Qtile-graphs (Qgraphs). Understand how QFactory leverages graph-level Qtile computation transformations to generate equivalent QGraphs, expanding optimization search spaces, followed by operator-level Qtile scheduling to identify optimal memory loading strategies. Examine the experimental results demonstrating QFactory's 1.66× average performance improvement over existing systems and 1.23× end-to-end generation speedup when integrated into state-of-the-art large language model serving systems, making this essential viewing for researchers and practitioners working on large language model optimization and quantization techniques.

Syllabus

USENIX ATC '25 - QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs

Taught by

USENIX

Reviews

Start your review of QFactory - Accelerating Quantized Large Language Model Serving with Qtile Graphs

2,000+ Free Courses with Certificates: Coding, AI, SQL, and More

AI Engineer - Learn how to integrate AI into software applications

Taught by

The Fastest Way to Become a Backend Developer Online

Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

DEEPSERVE - Serverless Large Language Model Serving at Scale

Accelerating the Training of Large Language Models - Efficient Activation Rematerialization and Optimal Hybrid Parallelism

GeneralSparse - Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs

Scaling of Quantized Large Language Models for Efficient Inference

Master Windows Internals - Kernel Programming, Debugging & Architecture Ad

[2026] Massive List of Thousands of Free Certificates and Badges

Write Prompts That Actually Work: ZTM’s Prompt Engineering Bootcamp Review

25 Resources to Learn Generative Engine Optimization in 2026

A Free Tool to Learn Languages Through Netflix and YouTube: Language Reactor Review

5 Best YouTube Marketing Courses for Business in 2026

Never Stop Learning.