Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

CLONE - Customizing LLMs for Efficient Latency-Aware Inference at the Edge

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about CLONE, an innovative algorithm-hardware co-design approach for deploying large language models on edge devices through this 17-minute conference presentation from USENIX ATC '25. Discover how researchers from the University of Macau address the critical challenges of running LLMs on resource-constrained edge devices while balancing latency requirements, energy consumption, and model accuracy. Explore the comprehensive solution that combines model-level and system-level optimizations with real-time energy optimization techniques to maintain robust generality across applications. Examine the specialized 28nm scalable hardware accelerator system designed to maximize synergistic benefits in always-on and intermediate edge computing environments. Understand the implementation and evaluation results demonstrating up to 11.92× acceleration in inference processes and up to 7.36× energy savings while preserving high-quality text generation on off-the-shelf edge platforms.

Syllabus

USENIX ATC '25 - CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge

Taught by

USENIX

Reviews

Start your review of CLONE - Customizing LLMs for Efficient Latency-Aware Inference at the Edge

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.