Production-Ready LLMs on Kubernetes: Patterns, Pitfalls, and Performance

This technical presentation explores the challenges and solutions for deploying open source Large Language Models (LLMs) on Kubernetes infrastructure. Learn from experts Priya Samuel and Luke Marsden as they share their practical experience implementing production-grade LLM systems. Through demonstrations, discover the complete deployment lifecycle from GPU configuration to advanced optimization techniques including Flash Attention, quantization tradeoffs, and GPU sharing. Gain valuable insights into architectural patterns using Ollama and vLLM, effective model weight management, context length optimization strategies, and production approaches to fine-tuning with Axolotl and multi-model serving with LoRAX. Walk away with a comprehensive blueprint for building reliable, scalable LLM infrastructure on Kubernetes that addresses common pitfalls while maximizing performance.