Benchmarking GenAI Foundation Model Inference Optimizations on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
AI Engineer - Learn how to integrate AI into software applications
Free courses from frontend to fullstack and AI
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about a Kubernetes SIG project designed to benchmark GenAI foundation model inference optimizations through this conference talk from KubeCon + CloudNativeCon. Discover how foundation models, which are general-purpose deep learning models trained on vast datasets and capable of handling diverse tasks, require optimization techniques to minimize recurring inference costs while maintaining accuracy. Explore various optimization methods including attention mechanism improvements like flash attention and paged attention, model parameter optimizations such as quantization, and serving optimizations including in-flight batching, speculative decoding, disaggregated serving, and smart routing strategies. Understand the critical need for consistent frameworks to measure and benchmark inference performance when testing and deploying optimization techniques. Gain insights into how this standardized benchmarking approach validates the performance and usability of inference optimizations for real-world applications in Kubernetes environments.
Syllabus
Benchmarking GenAI Foundation Model Inference Optimizations on Kubernetes - S.M. Varghese & B. Slabe
Taught by
CNCF [Cloud Native Computing Foundation]