Towards Memory Efficient RAG Pipelines with CXL Technology

Explore memory optimization strategies for Retrieval-Augmented Generation (RAG) pipelines using Compute Express Link (CXL) technology in this 15-minute conference talk. Learn how various stages of RAG AI-inference pipelines consume large volumes of data, particularly during the data preparation phase for creating and inserting embeddings into Vector databases, which requires significant transient memory. Discover how the search phase also increases memory consumption depending on index tree sizes and parallel queries, with peak memory usage varying based on RAG pipeline load including insertions and transient behaviors. Understand why statically provisioned local memory to meet peak usage proves inefficient and examine two proposed CXL memory approaches to address high memory challenges while reducing locally attached memory costs: CXL Memory Pooling for provisioning memory based on transient needs, and CXL Memory Tiering using cheaper, larger capacity memory solutions.