Gotta Cache 'em All - Scaling AI Workloads With Model Caching in a Hybrid Cloud

Explore how to scale AI workloads efficiently through model caching in hybrid cloud environments in this 31-minute conference talk from the Linux Foundation. Learn about the challenges of rapidly scaling inference services during peak hours and optimizing GPU utilization for fine-tuning workloads as AI models grow exponentially in size and complexity. Discover Bloomberg's Data Science Platform team's implementation of a "Model Cache" feature in the open source KServe project, designed for caching large models on GPUs across multi-cloud and multi-cluster cloud-native environments. Understand how model caching reduces load times during auto-scaling of services, improves resource utilization, and boosts data scientists' productivity. Gain insights into Bloomberg's integration of KServe's Model Cache into their AI workloads and their development of an API built on top of Karmada for managing cache federation. Learn about the significant impact of enabling model caching and discover practical strategies for adopting this feature in your own AI infrastructure environment, making this essential viewing for AI infrastructure engineers working with large-scale machine learning deployments.