UC San Diego Product Management Certificate — AI-Powered PM Training
Free courses from frontend to fullstack and AI
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore a comprehensive conference talk that addresses critical challenges in maintaining high-throughput AI workloads within Kubernetes environments. Learn how to implement zero-downtime upgrades for FUSE (Filesystem in Userspace) systems that support demanding applications like autonomous driving and large-scale recommendation systems. Discover practical solutions for overcoming common issues such as file descriptor invalidation, cache loss, and write interruptions that typically occur during filesystem upgrades or restarts. Examine real-world implementation strategies for self-healing mounts and rolling client upgrades in FUSE-based distributed file systems, with deep integration into Kubernetes CSI and Operators. Understand why the default CSI lifecycle proves inadequate for FUSE-based systems and gain insights into redesigning client upgrade processes to maintain active I/O sessions without disruption. Benefit from lessons learned in large-scale production deployments, including analysis of key failure cases encountered in early versions and the evolution of solutions that ensure GPUs remain fully utilized during system maintenance operations.
Syllabus
Enabling Seamless AI Workloads: Achieving Zero-Downtime Upgrades for FUSE in Kubernetes - Weiwei Zhu
Taught by
Linux Foundation