Alertmanager Has Amnesia - Should We Fix It?

Explore the challenges and potential solutions for Alertmanager's memory loss issue in this conference talk from the Cloud Native Computing Foundation. Learn about the fundamental design limitation where Alertmanager loses its understanding of actively firing alerts when rebooted, creating gaps in alert state awareness despite continuous alert flows from Prometheus rules. Discover how network unreliability, DNS hiccups, and intermittent failures compound this problem in high-volume production environments, leading to significant perception gaps in the alerting landscape. Examine the increased complexity introduced by clustered Alertmanager instances for high availability, including race conditions and edge cases that result in duplicate notifications or missing resolved alerts. Follow the speaker's team hackathon exploration of maintaining a shared external state view that persists across restarts, revisiting the 2019 proposal for using etcd as a potential backend. Gain insights into the motivations behind this approach, implementation details of their prototype solution, and progress updates since the initial hackathon, considering whether it's time to reconsider this architectural change given the increased adoption of the Prometheus ecosystem.