Building a Large Model Inference Platform for Heterogeneous Chinese Chips Based on VLLM
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to adapt vLLM, the popular open-source inference project, to support domestic Chinese GPUs and deploy it on Kubernetes for large model inference in this 11-minute keynote presentation. Discover the process of enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill on heterogeneous Chinese chips, while addressing the growing demand for domestic GPU adoption in inference workloads. Explore performance bottleneck analysis techniques and chip operator development strategies to maximize hardware potential when working with Chinese inference engines that are still developing in functionality, performance, and ecosystem maturity. Understand how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with minimal code, and examine llmaz's approach to heterogeneous GPU scheduling, monitoring, and elastic scaling for inference services in cloud-native environments.
Syllabus
Keynote: Building a Large Model Inference Platform for Heterogeneous Chinese Chips Base... Kante Yin
Taught by
CNCF [Cloud Native Computing Foundation]