Fit-to-Serve - How a New DRA Capability for Dynamic Device Sharing Fits Into Distributed LLM Serving
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore how Dynamic Resource Allocation (DRA) capabilities enhance distributed large language model serving through this 24-minute conference talk from CNCF. Learn about llm-d, a community-driven framework that modernizes LLM serving at scale within Kubernetes using a modular architecture that separates prefill and decode operations. Discover how the new DRA capability enables dynamic resource capacity requests and adjustments for compute and network devices, moving beyond traditional GPU units to more granular resource allocation including MIG slices. Understand how DRA's device selection based on fine-grained attributes and topology awareness eliminates the need for workarounds or rigid resource pools. See practical demonstrations of how these DRA enhancements make the llm-d framework more feasible and cost-effective, while examining remaining challenges and gaining insights for implementation in cloud-native environments.
Syllabus
Fit-to-Serve: How a New DRA Capability for Dynamic Device... Sunyanan Choochotkaew & Tatsuhiro Chiba
Taught by
CNCF [Cloud Native Computing Foundation]