Best Practices for Deploying LLM Inference, RAG and Fine-Tuning Pipelines on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Learn Excel & Financial Modeling the Way Finance Teams Actually Use Them
MIT Sloan AI Adoption: Build a Playbook That Drives Real Business ROI
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how to effectively deploy, scale, and manage Large Language Model (LLM) inference pipelines on Kubernetes in this technical conference talk from NVIDIA experts. Discover essential best practices for implementing common patterns including inference, retrieval-augmented generation (RAG), and fine-tuning workflows. Master techniques for reducing inference latency through model caching, optimizing GPU resource utilization with efficient scheduling strategies, handling multi-GPU/node configurations, and implementing auto-quantization. Explore methods for enhancing security through Role-Based Access Control (RBAC), setting up comprehensive monitoring, configuring auto-scaling, and supporting air-gapped cluster deployments. Follow demonstrations of building flexible pipelines using both a lightweight standalone operator-pattern tool and KServe, an open-source AI inference platform. Gain practical knowledge for post-deployment management to improve the performance, efficiency, and security of LLM deployments in Kubernetes environments.
Syllabus
Best Practices for Deploying LLM Inference, RAG and Fine... Meenakshi Kaushik & Shiva Krishna Merla
Taught by
CNCF [Cloud Native Computing Foundation]