Towards Memory Efficient RAG Pipelines with CXL Technology

Explore advanced memory optimization techniques for RAG (Retrieval-Augmented Generation) pipelines in AI inference systems through this 34-minute conference presentation from SNIA SDC 2025. Learn how CXL (Compute Express Link) technology addresses the significant memory challenges faced during various stages of RAG pipeline processing, including vector embedding creation, Vector DB insertion, and search operations that require substantial transient memory consumption. Discover two key CXL-based approaches: memory pooling for dynamic provisioning based on transient needs, and memory tiering using cheaper, larger capacity memory to reduce locally attached memory costs. Examine the current state of open-source infrastructure supporting these solutions and understand how they achieve significant DRAM cost savings with minimal performance trade-offs. Gain insights into typical memory requirements of VectorDB use cases in AI inference stages and explore how CXL-based methodologies can benefit DRAM Total Cost of Ownership (TCO) requirements. Understand the open-source software infrastructure needed to implement CXL memory pooling and tiering, while discussing potential ideas to bridge existing gaps in the technology stack. Presented by Arun George and Roshan Nair from Samsung Semiconductor India Research, this technical presentation provides practical solutions for optimizing memory efficiency in modern AI inference workloads.