Data-Intensive AI Inference Done Better - Offloading Model Weights and RAG Data to Storage

Explore advanced techniques for optimizing AI inference solutions with Retrieval-Augmented Generation (RAG) in this 35-minute conference talk from SNIA SDC 2025. Learn how enterprises can overcome infrastructure limitations and cost barriers when implementing complex AI models and large RAG datasets by leveraging open-source software components and high-performance NVMe SSDs. Discover two complementary approaches for achieving unprecedented scale: offloading model weights to storage using DeepSpeed and offloading RAG data to storage using DiskANN. Examine how combining these methods enables more complex models to run on GPUs that were previously unusable while achieving greater cost efficiency with large amounts of RAG data. Analyze benchmarking results demonstrating the impact of SSD offload on DRAM usage, queries per second (QPS), index time, and recall performance. Review a practical demonstration showing how this solution works in a real-world traffic video use case, and understand the broader opportunities and challenges associated with AI inference using RAG technology.