Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Resource Multiplexing in Tuning and Serving Large Language Models

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about LLMStation, a novel spatial-temporal multiplexing and scheduling system designed to optimize GPU utilization for concurrent large language model fine-tuning and inference operations. Discover how this research addresses the common problem of GPU underutilization in single-task deployments while maintaining strict service-level objectives. Explore the system's innovative approaches including iteration-level multitasking scheduling mechanisms, an Autograd engine that transforms tuning tasks into suspendable pipelines, and an inference engine capable of batching both inference and tuning requests. Examine evaluation results demonstrating throughput improvements of 1.38× to 14.77× compared to state-of-the-art systems while meeting inference latency requirements across various setups and workloads. Understand the technical challenges of achieving high utilization in complex LLM workloads and the practical solutions implemented to increase deployment efficiency in production environments.

Syllabus

USENIX ATC '25 - Resource Multiplexing in Tuning and Serving Large Language Models

Taught by

USENIX

Reviews

Start your review of Resource Multiplexing in Tuning and Serving Large Language Models

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.