Unified AIOps for Remote Management of Heterogeneous Open Source AI Systems
Open Compute Project via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to implement unified AIOps solutions for managing diverse open-source AI infrastructure in this 18-minute conference talk. Discover how modern AI systems require intelligent coordination beyond just hardware, encompassing open-source firmware, multiple platforms, and fragmented telemetry data. Explore techniques for normalizing multi-vendor telemetry through AIOps pipelines that enable predictive analytics to identify GPU degradation, thermal hotspots, and system inefficiencies. Master automated remediation strategies including firmware patching, workload migration, and adaptive cooling systems. Understand how AI chatbots can simplify operations through natural language interfaces, making complex infrastructure management more accessible. Examine closed-loop optimization approaches that connect workload behavior with infrastructure conditions to enhance performance-per-watt ratios and reduce carbon footprint. Gain insights into scalable, intelligent AI infrastructure management models that align with Open Compute Project values of openness, modularity, and sustainability for heterogeneous computing environments.
Syllabus
Unified AIOps for Remote Management of Heterogeneous Open Source AI Systems
Taught by
Open Compute Project