Optimized RAG - Strategies for Cost and Scale

Learn practical strategies for optimizing Retrieval-Augmented Generation (RAG) systems for production deployment in this comprehensive conference talk and hands-on coding lab. Explore the critical transition from prototype to production, focusing on reducing latency and cost while maintaining performance at scale. Discover high-impact optimization techniques across different stages of the RAG pipeline, including data preparation, retrieval and ranking, and generation with observability. Master embedding quantization to reduce memory footprint and compute costs, implement context highlighting for improved relevance and reduced latency, apply Reciprocal Rank Fusion ranking techniques for low-latency use cases, and utilize context compression methods. Participate in a hands-on coding laboratory using Python, Google Colab, Elasticsearch, and Hugging Face models to implement filtered search, embedding quantization, and context highlighting in real-world workflows. Gain insights from a senior data scientist at Elastic who specializes in GenAI-powered search solutions and has extensive experience developing AI solutions across internet-scale platforms, metals and mining, oil and gas, and e-commerce domains.