Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Beyond Uptime - Revolutionizing Fintech Reliability

Conf42 via YouTube

Start learning Write review

Explore advanced Site Reliability Engineering strategies specifically tailored for fintech environments in this 41-minute conference talk from Conf42 SRE 2025. Discover how to move beyond traditional uptime metrics to implement comprehensive reliability solutions that address the unique challenges of financial technology infrastructure. Learn to optimize GPU utilization for AI workloads, implement advanced auto-scaling strategies, and orchestrate efficient multi-GPU training systems. Master comprehensive monitoring and observability techniques while optimizing storage solutions for machine learning pipelines. Understand cost-efficient cloud spending strategies using spot instances and learn to configure node pools for maximum efficiency. Gain insights into fair resource sharing among teams and measure the impact of optimization strategies. Follow a step-by-step implementation guide that covers challenges in AI infrastructure and Kubernetes, providing practical solutions for revolutionizing fintech reliability through innovative SRE practices.

Syllabus

00:00 Introduction to FinTech Reliability and SRE Innovation
00:08 Challenges in AI Infrastructure and Kubernetes
01:43 Optimizing GPU Utilization for AI Workloads
03:25 Advanced Auto-Scaling Strategies
06:01 Efficient Multi-GPU Training Orchestration
08:25 Comprehensive Monitoring and Observability
11:16 Optimizing Storage for ML Pipelines
13:39 Cost-Efficient Cloud Spend with Spot Instances
14:53 Optimizing Node Pool Configurations
16:42 Fair Resource Sharing Among Teams
17:52 Impact of Optimization Strategies
18:55 Step-by-Step Guide to Implementing Strategies
20:36 Conclusion and Final Thoughts