From Hours to Milliseconds - Scaling AI Inference 10x With Serverless on Kubernetes

Discover how to dramatically improve AI inference performance in this 37-minute conference talk from the Linux Foundation's Open Source Summit. Learn practical techniques for reducing AI model inference latency from several seconds to under 100 milliseconds while achieving 10x throughput improvements using serverless architecture on Kubernetes. Explore real-world implementations with Knative and Kubeless for deploying containerized machine learning models as serverless functions, complete with Infrastructure as Code (IaC) demonstrations for entire infrastructure provisioning. Master event-driven automation for data preprocessing pipelines that can reduce preparation time by 30%, and understand how to distribute model training jobs across ephemeral pods to cut training times by 40% on large datasets. Gain insights into dynamic scaling of inference endpoints based on real-time traffic spikes, moving beyond theoretical serverless benefits to achieve measurable performance gains in production AI applications.