From Outage to Observability - Lessons From a Kubernetes Meltdown

Learn from a real-world Kubernetes disaster in this keynote conference talk that examines how a major outage exposed critical gaps in observability infrastructure. Discover why basic monitoring and logging with Prometheus and ELK stack proved insufficient when a DevOps automation platform experienced a catastrophic failure that brought down an entire multi-tenant cluster. Explore the root causes including poor log correlation, inadequate tracing, resource overwhelm from a single customer's CI/CD pipeline, aggressive autoscaling that overloaded the control plane, and lack of proper tenant isolation. Understand the comprehensive recovery strategy that involved implementing distributed Prometheus monitoring, establishing per-tenant workload isolation with dedicated monitoring, implementing fine-grained autoscaling policies, and experimenting with advanced tracing tools like Parca and Odigos. Gain practical insights into building resilient Kubernetes infrastructure that can handle scale while maintaining observability, and learn how to prevent similar outage scenarios in your own cloud native environments.