Accelerating LLMs at the Edge - The Power of Efficient HW-SW Co-Design
EDGE AI FOUNDATION via YouTube
Gain a Splash of New Skills - Coursera+ Annual Just ₹7,999
AI Engineer - Learn how to integrate AI into software applications
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to accelerate Large Language Models (LLMs) on edge devices through efficient hardware-software co-design in this 26-minute conference talk. Discover a simulation-first co-design methodology (SECDA) integrated with llama.cpp that enables rapid iteration in minutes rather than days, allowing you to develop accelerators that deliver meaningful performance improvements. Explore the key challenges blocking edge LLM deployment, including high-level synthesis cycles that halt progress, memory-bound inference that doesn't benefit from additional CPU threads, and quantization formats that don't map efficiently to general-purpose cores. Understand how llama.cpp, GGUF format, and deep quantization techniques enable compact models across diverse hardware platforms. Examine the SECDA-LLM toolkit that offloads critical kernels through a GGML backend, enabling custom FPGA operator prototyping while maintaining clean, portable code architecture. Analyze two practical implementations: a format-aware matrix multiplication engine for TinyLlama that handles packed weights, applies block and superblock scalars, and optimizes tile scheduling for maximum reuse, achieving up to 11x latency reduction on ARM+FPGA boards compared to CPU-only execution. Study a dynamic superblock processor for mixed block floating point operations across layers, supporting formats like Q3K and Q2 simultaneously by running scale paths in parallel and performing late selection to eliminate inner-loop branches. Follow a comprehensive roadmap for future development including broader BFP support for 4-6 bits, emerging attention variants, shift-based arithmetic for cost reduction, and sparsity integration into dataflow architectures. Master the simulation-first workflow of simulate, measure, refine, then synthesize, transforming edge LLM acceleration from complex engineering challenges into manageable development cycles for low-latency, private inference on resource-constrained devices.
Syllabus
Accelerating LLMs at the Edge: The Powerof Efficient HW-SW Co-Design
Taught by
EDGE AI FOUNDATION