Meta's Kubernetes-based Portable AI Research Environment
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how Meta developed a Kubernetes-based portable AI research environment in this conference talk from KubeCon + CloudNativeCon. Discover Meta's collaboration with CoreWeave to implement SUNK (Slurm on Kubernetes), creating a unified computing platform that enables AI researchers to work consistently across diverse multi-cloud infrastructures. Explore how this solution addresses the growing demands of AI research by providing a familiar Slurm interface while abstracting away underlying infrastructure complexity through Kubernetes orchestration. Understand the architecture that delivers secure per-user isolation, shared storage mounts, streamlined access management, and comprehensive health checking across heterogeneous environments. Examine how the platform enables infrastructure engineers to deploy consistent, portable solutions across multiple cloud providers while maintaining deep centralized observability and unified security controls. Gain insights into novel patterns for enabling users to deploy infrastructure on Kubernetes without realizing the underlying complexity, and learn how OpenTelemetry serves as a unified interface for both platform-level and research-level monitoring and insights.
Syllabus
Meta’s Kubernetes-based Portable AI Research Environment - Shaun Hopper, Meta & Navarre Pratt
Taught by
CNCF [Cloud Native Computing Foundation]