Chameleon - Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
Scalable Parallel Computing Lab, SPCL @ ETH Zurich via YouTube
Learn Generative AI, Prompt Engineering, and LLMs for Free
Stuck in Tutorial Hell? Learn Backend Dev the Right Way
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore a research presentation on Chameleon, a novel heterogeneous accelerator system designed to optimize Retrieval-Augmented Language Models (RALMs) through innovative hardware architecture. Learn how this system combines large language models with vector databases to achieve context-specific knowledge retrieval during text generation, enabling impressive generation quality with smaller models while reducing computational demands by orders of magnitude. Discover the key principles behind Chameleon's disaggregated architecture that integrates both LLM and vector search accelerators, allowing independent scaling to meet diverse RALM requirements. Examine the prototype implementation that utilizes FPGAs for vector search acceleration, GPUs for LLM inference, and CPUs as cluster coordinators. Understand the performance benefits demonstrated through comprehensive evaluation, including up to 2.16× reduction in latency and 3.18× speedup in throughput compared to traditional hybrid CPU-GPU architectures. Gain insights into how heterogeneous accelerator systems can revolutionize both LLM inference and vector search capabilities in future RALM deployments, presented by Wenqi Jiang from the Scalable Parallel Computing Lab at ETH Zurich based on research published in the Proceedings of the VLDB Endowment.
Syllabus
Chameleon: Heterogeneous & Disaggregated Accelerator System for Retrieval-Augmented Language Models
Taught by
Scalable Parallel Computing Lab, SPCL @ ETH Zurich