Explore a conference talk that delves into building robust and scalable Machine Learning inference pipelines using Kubernetes. Learn how to construct performant inferencing services that can handle on-demand scaling while maintaining optimal latency. Discover proven procedures and guidelines for managing inference pipelines on Kubernetes, including detailed insights into hardware requirements (GPU/CPU/memory) and essential K8s configurations for various inference engines. Master the implementation of fault-tolerant pipelines for Large Language Models (LLM) and Retrieval-Augmented Generation (RAG) using fundamental Kubernetes constructs such as operators, statefulsets, and persistent volumes. Gain practical knowledge about setting up automated monitoring systems and implementing effective strategies for troubleshooting and fixing hardware and software component failures in production environments.