Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Alertmanager Has Amnesia - Should We Fix It?

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the challenges and potential solutions for Alertmanager's memory loss issue in this conference talk from the Cloud Native Computing Foundation. Learn about the fundamental design limitation where Alertmanager loses its understanding of actively firing alerts when rebooted, creating gaps in alert state awareness despite continuous alert flows from Prometheus rules. Discover how network unreliability, DNS hiccups, and intermittent failures compound this problem in high-volume production environments, leading to significant perception gaps in the alerting landscape. Examine the increased complexity introduced by clustered Alertmanager instances for high availability, including race conditions and edge cases that result in duplicate notifications or missing resolved alerts. Follow the speaker's team hackathon exploration of maintaining a shared external state view that persists across restarts, revisiting the 2019 proposal for using etcd as a potential backend. Gain insights into the motivations behind this approach, implementation details of their prototype solution, and progress updates since the initial hackathon, considering whether it's time to reconsider this architectural change given the increased adoption of the Prometheus ecosystem.

Syllabus

Alertmanager Has Amnesia – Should We Fix It? - Joel Verezhak

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Alertmanager Has Amnesia - Should We Fix It?

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.