High-Throughput ML - Mastering Efficient Model Serving at Enterprise Scale

Discover how to architect and implement high-performance machine learning model serving systems capable of handling thousands of predictions per second in this 27-minute conference talk from Databricks. Learn the essential techniques for building inference pipelines that scale efficiently to massive request volumes while maintaining low latency requirements. Explore how to leverage Databricks Feature Store for consistent, low-latency feature lookups and implement auto-scaling strategies that balance performance optimization with cost management. Master the QPS × model execution time formula for determining optimal compute capacity and understand how to configure Feature Store for high-throughput operations. Gain insights into managing cold starts and scaling strategies specifically designed for latency-sensitive applications, while implementing comprehensive monitoring systems that provide visibility into inference performance. Apply these practical strategies to enterprise-grade ML serving systems, whether you're deploying recommender systems, real-time fraud detection models, or other high-volume prediction services.