Help! My LLM Is a Resource Hog - How We Tamed Inference With Kubernetes and Open Source Muscle

Learn how to optimize large language model (LLM) inference performance and resource management using Kubernetes and open-source CNCF tools in this 26-minute conference talk. Discover practical solutions for addressing common LLM deployment challenges including slow inference speeds, unpredictable GPU usage, and escalating costs through a real-world case study presented by experts from Forrester Research and vCluster. Master the implementation of KServe and Kubeflow for reliable LLM serving, explore benchmarking and auto-scaling techniques using Volcano and KEDA to optimize resource utilization and reduce latency, and understand how to monitor model performance and detect drift using Prometheus, Grafana, and OpenTelemetry. Gain insights from field-tested architectures, performance benchmarks, and lessons learned while building production-ready, efficient, and scalable LLM inference systems using entirely open-source tooling that you can implement immediately.