Optimizing Multi-Agent LLM Workloads With AMD GPUs and Kueue
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn advanced optimization strategies for multi-agent Large Language Model workloads using AMD GPUs and Kueue in this 30-minute conference talk from the Cloud Native Computing Foundation. Explore the distinct compute-memory phase transitions in LLM inference workloads, where prompt ingestion involves compute-bound attention calculations while token generation becomes memory-bound due to repeated parameter loading from DRAM and HBM. Discover how multi-agent systems integrate heterogeneous components with disparate resource demands that must operate synchronously, and understand how AMD GPUs and Kueue optimize compute and memory partitioning, binpacking, and colocation of tightly coupled agentic workflows alongside inference tasks with bursty resource patterns. Master strategies to design advanced scheduling and binpacking for agent interaction workflows to achieve 50-70% higher throughput compared to traditional approaches. Examine how high-capacity, high-bandwidth GPUs such as AMD MI355x are optimized for mixed-workload AI applications and learn to leverage unified memory access to minimize cross-component latency while preserving isolation. Gain practical insights from industry experts Yuchen Fama from Cognality, Jodie Su from AMD, and Zhiming Shen from Exostellar on implementing these optimization techniques in cloud-native environments.
Syllabus
Optimizing Multi-Agent LLM Workloads With AMD GPUs and Kueue - Yuchen Fama, Jodie Su & Zhiming Shen
Taught by
CNCF [Cloud Native Computing Foundation]