AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore a comprehensive conference talk that addresses critical challenges in maintaining high-throughput AI workloads within Kubernetes environments. Learn how to implement zero-downtime upgrades for FUSE (Filesystem in Userspace) systems that support demanding applications like autonomous driving and large-scale recommendation systems. Discover practical solutions for overcoming common issues such as file descriptor invalidation, cache loss, and write interruptions that typically occur during filesystem upgrades or restarts. Examine real-world implementation strategies for self-healing mounts and rolling client upgrades in FUSE-based distributed file systems, with deep integration into Kubernetes CSI and Operators. Understand why the default CSI lifecycle proves inadequate for FUSE-based systems and gain insights into redesigning client upgrade processes to maintain active I/O sessions without disruption. Benefit from lessons learned in large-scale production deployments, including analysis of key failure cases encountered in early versions and the evolution of solutions that ensure GPUs remain fully utilized during system maintenance operations.