Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about the redesigned event-based retry logic for the orchagent program in SONiC's swss module through this 22-minute conference talk. Discover how the original brute-force retry strategy created workflow bottlenecks when handling accumulated failures at scale, and explore the innovative solutions developed by Alibaba engineers. Examine the key insights that led to the redesign: identifying which failures warrant immediate retry, analyzing failure reasons as constraints, implementing real-time notification mechanisms for constraint resolution, and preventing retry processes from starving normal event workflows. Understand the technical challenges of managing failed tasks in network operating systems and the strategic approach to optimizing retry mechanisms for better system performance and reliability.