Load-Aware GPU Fractioning for LLM Inference on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Learn Generative AI, Prompt Engineering, and LLMs for Free
Master Windows Internals - Kernel Programming, Debugging & Architecture
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about optimizing GPU resource allocation for Large Language Model (LLM) inference on Kubernetes in this technical conference talk from IBM researchers. Explore the challenges of efficient GPU utilization and discover an analytical approach to understanding the relationship between request loads and resource requirements. Examine how GPU compute and memory requirements for LLM inference servers like vLLM correlate with configuration parameters and key performance metrics. Master the implementation of optimal GPU fractioning at deployment time based on model characteristics and estimated workloads. Watch a demonstration of an open-source controller that automatically converts whole GPU requests into fractional requests using MIG (Multi-Instance GPU) slices, enabling improved resource density and sustainability while maintaining service level objectives.
Syllabus
Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM
Taught by
CNCF [Cloud Native Computing Foundation]