CLONE - Customizing LLMs for Efficient Latency-Aware Inference at the Edge

Learn about CLONE, an innovative algorithm-hardware co-design approach for deploying large language models on edge devices through this 17-minute conference presentation from USENIX ATC '25. Discover how researchers from the University of Macau address the critical challenges of running LLMs on resource-constrained edge devices while balancing latency requirements, energy consumption, and model accuracy. Explore the comprehensive solution that combines model-level and system-level optimizations with real-time energy optimization techniques to maintain robust generality across applications. Examine the specialized 28nm scalable hardware accelerator system designed to maximize synergistic benefits in always-on and intermediate edge computing environments. Understand the implementation and evaluation results demonstrating up to 11.92× acceleration in inference processes and up to 7.36× energy savings while preserving high-quality text generation on off-the-shelf edge platforms.