Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How to Serve Big LLM over Decentralized GPUs - Parallax and Dynamic Programming

Yacine Mahdid via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to deploy large language model inference services using decentralized GPU networks through Parallax, a revolutionary scheduling methodology that makes AI infrastructure more accessible and cost-effective. Explore the technical challenges of serving big LLMs, which traditionally require expensive high-end GPUs with high-bandwidth interconnects in specialized data centers, and discover how Parallax overcomes these limitations by leveraging heterogeneous GPUs distributed worldwide. Master the system's two-stage scheduling approach, beginning with Phase 1's model layer allocation using the water filling method to efficiently distribute model components across available GPUs with varying capabilities in speed, memory, and network quality. Dive into Phase 2's pipeline chain selection process that dynamically chooses optimal execution paths when user requests arrive, minimizing latency while maximizing throughput. Understand the dynamic rebalancing mechanisms that adapt to changing network conditions and resource availability in real-time. Analyze comprehensive performance results comparing latency and throughput metrics against traditional centralized approaches, and examine scaling studies that demonstrate the system's effectiveness across different network sizes. Gain insights into the dynamic programming principles underlying Parallax's optimization algorithms, making this an excellent example of advanced algorithmic design applied to distributed systems. Access supporting research materials including the original academic paper and practical implementation guides for setting up your own decentralized LLM serving infrastructure.

Syllabus

- Introduction:
- Parallax Overview:
- Phase 1 Allocating Model Layers:
- Phase 1 Water Filling Method:
- Phase 2 Pipeline Chain Selection:
- "Phase 3" Dynamic Rebalancing:
- Result overview:
- Latency and Througput Comparison:
- Scaling Study:
- Conclusion:

Taught by

Yacine Mahdid

Reviews

Start your review of How to Serve Big LLM over Decentralized GPUs - Parallax and Dynamic Programming

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.