LLMs in the Wild - Streaming, RAG, and Real-Time GenAI at Scale

Explore the practical challenges of deploying and scaling Generative AI systems in production environments using Scala, FS2, and Server-Sent Events. Learn how to architect real-time GenAI applications that serve thousands of users with low-latency responses while implementing Retrieval-Augmented Generation (RAG) to ground Large Language Models in dynamic business data. Discover techniques for streaming token-by-token outputs, orchestrating document retrieval pipelines on-the-fly, and managing critical production concerns including memory pressure, backpressure, observability, error handling, and model fallbacks. Gain architectural patterns, tooling recommendations, and battle-tested lessons for building production-ready GenAI services such as chatbots, AI assistants, and document question-answering systems that can scale reliably in real-world environments.