A SRE's Guide to LLMOps: Deploying and Managing AI/ML Workloads Using Kubernetes

This webinar, led by Harsh Mishra, SRE at One2N Consulting, addresses the unique challenges Site Reliability Engineers and DevOps professionals face when deploying Large Language Models (LLMs) and Retrieval-Augmented Generation (RAGs) using Kubernetes. Discover how LLMOps differs fundamentally from traditional MLOps, requiring specialized knowledge from Day 0. Explore essential prerequisites for hosting LLMs on Kubernetes, understand the core components including RAGs/Vector Databases, and learn through a practical case study using Nvidia GPUs and Ray Distributed. The 32-minute presentation covers the journey toward implementing true LLMOps, compares infrastructure and architectural differences between MLOps and LLMOps, examines the transition from VMs to Kubernetes for AI workloads, and provides insights on vector storage considerations. Gain practical takeaways from real-world implementations that will help future-proof your cloud strategy for the rapidly evolving field of GenAI and LLMs.

Syllabus

0:00 - Intro
0:24 - Why SREs need to worry about LLMs
1:11 - One2N's journey towards true LLMOps
15:30 - One2N's successes & challenges with K8s LLMOps
18:35 - MLOps vs LLMOps: Infra, Deployment & Architecture
19:31 - VMs to Kubernetes for LLMOps
23:07 - Vector Storage: index vs database
25:12 - Deep dive into One2N's Lab setup real-world use case
28:58 - Key takeaways to get started with LLMOps
31:23 - Outro