Lead AI-Native Products with Microsoft's Agentic AI Program
Power BI Fundamentals - Create visualizations and dashboards from scratch
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore memory optimization strategies for Retrieval-Augmented Generation (RAG) pipelines using Compute Express Link (CXL) technology in this 15-minute conference talk. Learn how various stages of RAG AI-inference pipelines consume large volumes of data, particularly during the data preparation phase for creating and inserting embeddings into Vector databases, which requires significant transient memory. Discover how the search phase also increases memory consumption depending on index tree sizes and parallel queries, with peak memory usage varying based on RAG pipeline load including insertions and transient behaviors. Understand why statically provisioned local memory to meet peak usage proves inefficient and examine two proposed CXL memory approaches to address high memory challenges while reducing locally attached memory costs: CXL Memory Pooling for provisioning memory based on transient needs, and CXL Memory Tiering using cheaper, larger capacity memory solutions.
Syllabus
Towards memory efficient RAG pipelines with CXL technology
Taught by
Open Compute Project