Honey, I Shrunk the Data Center! - Chaos Engineering and Data Center Resilience Testing

Learn how to conduct extreme chaos engineering experiments by completely shutting down entire data centers in this conference talk from SREcon25 Europe/Middle East/Africa. Discover Allegro's five-year journey of deliberately pulling the plug on data centers to test service resilience and maintain seamless operations during complete facility outages. Explore the evolution of chaos engineering practices, master the art of coordinated physical server shutdowns and startups, and understand capacity testing methodologies for critical system paths. Gain insights into building reliable dependency maps that accurately reflect your infrastructure's interconnections. Understand strategies for securing executive buy-in for high-risk chaos experiments and learn how to navigate the unique challenges that emerge in hybrid cloud environments. Examine real-world pitfalls encountered during these extreme experiments alongside significant victories that improved system reliability. Acquire practical knowledge applicable to site reliability engineering roles and technical leadership positions, with actionable insights for implementing similar chaos engineering practices in your own organization.