Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Building a Large Model Inference Platform for Heterogeneous Chinese Chips Based on VLLM

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to adapt vLLM, the popular open-source inference project, to support domestic Chinese GPUs and deploy it on Kubernetes for large model inference in this 11-minute keynote presentation. Discover the process of enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill on heterogeneous Chinese chips, while addressing the growing demand for domestic GPU adoption in inference workloads. Explore performance bottleneck analysis techniques and chip operator development strategies to maximize hardware potential when working with Chinese inference engines that are still developing in functionality, performance, and ecosystem maturity. Understand how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with minimal code, and examine llmaz's approach to heterogeneous GPU scheduling, monitoring, and elastic scaling for inference services in cloud-native environments.

Syllabus

Keynote: Building a Large Model Inference Platform for Heterogeneous Chinese Chips Base... Kante Yin

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Building a Large Model Inference Platform for Heterogeneous Chinese Chips Based on VLLM

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.