Resilient On-Premises AI Workloads on Kubernetes with Hyperconverged Infrastructure

Learn to deploy resilient AI workloads on Kubernetes using hyperconverged infrastructure in this 14-minute conference talk. Discover how to build fault-tolerant systems by integrating compute, storage, and networking into unified platforms that eliminate single points of failure. Explore the deployment of OpenShift clusters on hyperconverged infrastructure (HCI) to ensure high availability and operational efficiency for complex AI workloads. Master the design principles for creating robust systems with multiple servers and networks, while understanding how Software-defined Storage (SDS) provides scalability, resilience, and seamless data access. Examine critical business continuity strategies including backup policies, disaster recovery plans, and DR protections to minimize downtime and safeguard against data loss. Compare the performance and reliability trade-offs between bare metal and virtual machine deployments for AI workloads. Gain practical insights into streamlining day-two operations through automated monitoring, alerting tools, firmware upgrades, auto-scaling, and proactive issue resolution techniques that enhance overall system reliability and performance.