Chameleon - Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
Scalable Parallel Computing Lab, SPCL @ ETH Zurich via YouTube
Learn Backend Development Part-Time, Online
Power BI Fundamentals - Create visualizations and dashboards from scratch
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore a research presentation on Chameleon, a novel heterogeneous accelerator system designed to optimize Retrieval-Augmented Language Models (RALMs) through innovative hardware architecture. Learn how this system combines large language models with vector databases to achieve context-specific knowledge retrieval during text generation, enabling impressive generation quality with smaller models while reducing computational demands by orders of magnitude. Discover the key principles behind Chameleon's disaggregated architecture that integrates both LLM and vector search accelerators, allowing independent scaling to meet diverse RALM requirements. Examine the prototype implementation that utilizes FPGAs for vector search acceleration, GPUs for LLM inference, and CPUs as cluster coordinators. Understand the performance benefits demonstrated through comprehensive evaluation, including up to 2.16× reduction in latency and 3.18× speedup in throughput compared to traditional hybrid CPU-GPU architectures. Gain insights into how heterogeneous accelerator systems can revolutionize both LLM inference and vector search capabilities in future RALM deployments, presented by Wenqi Jiang from the Scalable Parallel Computing Lab at ETH Zurich based on research published in the Proceedings of the VLDB Endowment.
Syllabus
Chameleon: Heterogeneous & Disaggregated Accelerator System for Retrieval-Augmented Language Models
Taught by
Scalable Parallel Computing Lab, SPCL @ ETH Zurich