This conference talk from SREcon25 Americas explores the challenges and solutions for safely evaluating and deploying AI models in production environments. Brendan Burns from Microsoft shares practical insights from the development of Azure Copilot, focusing on the unique reliability challenges posed by AI systems. Learn how to implement effective evaluation frameworks for new models and prompts where performance isn't simply "working" or "broken" but requires probabilistic assessment across numerous user interactions. Discover methodologies for determining when model changes represent improvements versus regressions that require fixes or rollbacks. The presentation provides hands-on approaches currently used in production systems to maintain reliability when AI models form core components of user experiences.