Chameleon - Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

Explore a research presentation on Chameleon, a novel heterogeneous accelerator system designed to optimize Retrieval-Augmented Language Models (RALMs) through innovative hardware architecture. Learn how this system combines large language models with vector databases to achieve context-specific knowledge retrieval during text generation, enabling impressive generation quality with smaller models while reducing computational demands by orders of magnitude. Discover the key principles behind Chameleon's disaggregated architecture that integrates both LLM and vector search accelerators, allowing independent scaling to meet diverse RALM requirements. Examine the prototype implementation that utilizes FPGAs for vector search acceleration, GPUs for LLM inference, and CPUs as cluster coordinators. Understand the performance benefits demonstrated through comprehensive evaluation, including up to 2.16× reduction in latency and 3.18× speedup in throughput compared to traditional hybrid CPU-GPU architectures. Gain insights into how heterogeneous accelerator systems can revolutionize both LLM inference and vector search capabilities in future RALM deployments, presented by Wenqi Jiang from the Scalable Parallel Computing Lab at ETH Zurich based on research published in the Proceedings of the VLDB Endowment.