Building a Large Model Inference Platform for Heterogeneous Chinese Chips Based on VLLM
CNCF [Cloud Native Computing Foundation] via YouTube
PowerBI Data Analyst - Create visualizations and dashboards from scratch
UC San Diego Product Management Certificate — AI-Powered PM Training
Overview
Build a Learning Habit
Download Class Central's free printable study calendar
Download for Free
Learn how to adapt vLLM, the popular open-source inference project, to support domestic Chinese GPUs and deploy it on Kubernetes for large model inference in this 11-minute keynote presentation. Discover the process of enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill on heterogeneous Chinese chips, while addressing the growing demand for domestic GPU adoption in inference workloads. Explore performance bottleneck analysis techniques and chip operator development strategies to maximize hardware potential when working with Chinese inference engines that are still developing in functionality, performance, and ecosystem maturity. Understand how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with minimal code, and examine llmaz's approach to heterogeneous GPU scheduling, monitoring, and elastic scaling for inference services in cloud-native environments.
Syllabus
Keynote: Building a Large Model Inference Platform for Heterogeneous Chinese Chips Base... Kante Yin
Taught by
CNCF [Cloud Native Computing Foundation]