Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

From Reliable Models to Resilient ML Platforms

Conf42 via YouTube

Start learning Write review

Learn how to transition machine learning models from development environments to production-ready, resilient platforms in this 25-minute conference talk. Explore the fundamental challenges of production ML including model drift, scaling issues, latency requirements, and availability concerns. Discover the advantages of modern cloud-native platforms over legacy systems, with IBM Cloud/SoftLayer serving as a practical infrastructure example. Master the essential pillars of resilient ML infrastructure including high availability and disaster recovery strategies. Implement security-by-design principles incorporating zero trust architecture and protection against DDoS attacks and ransomware threats. Understand how to sustain ML workloads through proper rate limiting, traffic spike management, and DDoS readiness protocols. Examine critical operational aspects including environment segmentation, isolation techniques, and secure model serving practices. Align frameworks with operational controls covering identity and access management, audit logging, and container image scanning. Establish performance metrics and resiliency benchmarking using service level objectives and agreements. Navigate the people and process considerations for cross-functional ownership in production ML environments. Compare deployment patterns across cloud-native, hybrid, and multi-cloud architectures. Gain practical design principles and key takeaways for building robust ML platforms that can reliably serve models at scale.

Syllabus

Welcome & Speaker Introduction Riva at Con 42 20 26
Talk Overview: Moving ML from Lab to Production + Agenda
Why Production ML Is Hard: Drift, Scale, Latency & Availability
Modern Platforms vs Legacy: Cloud-Native Capabilities
IBM Cloud/SoftLayer as an Example Infrastructure Foundation
Pillars of Resilient ML Infrastructure: HA & Disaster Recovery
Security by Design: Zero Trust, DDoS/Ransomware Protection
Sustaining ML Workloads: Rate Limits, Traffic Spikes & DDoS Readiness
Segmentation, Environment Isolation & Secure Model Serving
Framework Alignment & Operational Controls: IAM, Audit Logs, Image Scanning
Performance Metrics & Resiliency Benchmarking SLOs/SLAs
People & Process: Cross-Functional Ownership for Production ML
Deployment Patterns: Cloud-Native vs Hybrid vs Multi-Cloud
Design Principles & Key Takeaways + Closing/Q&A