The Truth About RAG and vLLM - Why Your Multimodal System Fails at Scale

Discover the critical optimizations needed to build production-ready Retrieval-Augmented Generation (RAG) and self-hosted multimodal systems in this 47-minute technical conference talk. Learn why most RAG implementations fail at scale and master the essential decisions around embedding models, vector indexing, and inference optimization that separate successful deployments from failed experiments. Explore the fundamentals of vector search and understand the tradeoffs between different indexing approaches including FLAT, IVF, and HNSW algorithms. Examine the speed versus accuracy versus cost matrix that governs index selection and discover why choosing the wrong embedding models can doom your project from the start. Dive deep into the complete RAG pipeline from unstructured data processing to retrieval, and understand why proper evaluation frameworks are essential instead of relying on intuition-based development. Compare RAG approaches against long-context LLM alternatives using real benchmarks from Llama 4 and Gemini 2.5, and implement hybrid search strategies combining BM25 with similarity search and metadata filtering. Build a complete self-hosted multimodal RAG stack using Pixtral, Milvus, and vLLM while addressing the critical inference challenges of latency, throughput, and batching. Master advanced optimization techniques including model parallelism through tensor parallelism, quantization strategies, and paged attention with KV cache management. Follow along with a live demonstration of the complete architecture and setup, and participate in Q&A covering chunking strategies, CAG (Context-Augmented Generation), and other advanced topics essential for senior engineers and architects working with large-scale AI systems.

Syllabus

0:00 - Introduction & Multimodal RAG with Pixtral & vLLM
1:45 - What Vector Search Actually Is and Why It Matters
4:10 - Indexing Deep Dive: FLAT, IVF, and HNSW Explained
7:50 - The Index Tradeoff Matrix: Speed vs. Accuracy vs. Cost
9:45 - Embedding Models: Stop Using the Wrong Ones!
13:30 - The RAG Pipeline: From Unstructured Data to Retrieval
14:20 - Why Vibe Coding Your RAG is a Disaster Proper Evals
16:45 - RAG vs. The Long Context LLM Myth Llama 4, Gemini 2.5 Benchmarks
19:10 - Hybrid Search BM25 + Similarity and Metadata Filtering
22:30 - Building the Self-Hosted Multimodal RAG Stack Pixtral, Milvus, vLLM
25:05 - The Inference Challenge: Latency, Throughput, and Batching
27:15 - Model Parallelism: Why You Need to Split the Model Tensor Parallelism
29:30 - Optimization Secrets: Quantization & Paged Attention KV Cache
31:50 - Live Demo Architecture & Setup
33:05 - Q&A: Chunking Strategy, CAG, and More