Lead AI Strategy with UCSB's Agentic AI Program — Microsoft Certified
Get 20% off all career paths from fullstack to AI
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore a 31-minute conference talk from SREcon24 Europe/Middle East/Africa that delves into applying Site Reliability Engineering principles to High Performance Computing (HPC) systems. Learn how LANL addresses the challenges of managing purpose-built HPC machines traditionally operated through human-facing workflows. Discover how the adoption of SRE methodologies in the new administrative stack OpenCHAMI helps maintain critical performance metrics while combating generational churn in HPC systems. Understand how this approach ensures exact reproducibility, parallel bandwidth, and optimal compute time to solution while better serving the specific needs of specialized code bases and customer requirements.
Syllabus
SREcon24 Europe/Middle East/Africa - Science Reliability Engineering for High Performance Computing
Taught by
USENIX