Building a Large Model Inference Platform for Heterogeneous Chinese Chips Based on VLLM

Learn how to adapt vLLM, the popular open-source inference project, to support domestic Chinese GPUs and deploy it on Kubernetes for large model inference in this 11-minute keynote presentation. Discover the process of enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill on heterogeneous Chinese chips, while addressing the growing demand for domestic GPU adoption in inference workloads. Explore performance bottleneck analysis techniques and chip operator development strategies to maximize hardware potential when working with Chinese inference engines that are still developing in functionality, performance, and ecosystem maturity. Understand how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with minimal code, and examine llmaz's approach to heterogeneous GPU scheduling, monitoring, and elastic scaling for inference services in cloud-native environments.