Overview
Syllabus
00:00 Introduction to FinTech Reliability and SRE Innovation
00:08 Challenges in AI Infrastructure and Kubernetes
01:43 Optimizing GPU Utilization for AI Workloads
03:25 Advanced Auto-Scaling Strategies
06:01 Efficient Multi-GPU Training Orchestration
08:25 Comprehensive Monitoring and Observability
11:16 Optimizing Storage for ML Pipelines
13:39 Cost-Efficient Cloud Spend with Spot Instances
14:53 Optimizing Node Pool Configurations
16:42 Fair Resource Sharing Among Teams
17:52 Impact of Optimization Strategies
18:55 Step-by-Step Guide to Implementing Strategies
20:36 Conclusion and Final Thoughts
Taught by
Conf42