Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Accelerating LLMs at the Edge - The Power of Efficient HW-SW Co-Design

EDGE AI FOUNDATION via YouTube

Start learning Write review

Details

Start learning

Provider

YouTube
Pricing

Free Video
Languages

English
Effort

26 minutes
Sessions

Self-Paced
Level

Advanced

Found in

Learn how to accelerate Large Language Models (LLMs) on edge devices through efficient hardware-software co-design in this 26-minute conference talk. Discover a simulation-first co-design methodology (SECDA) integrated with llama.cpp that enables rapid iteration in minutes rather than days, allowing you to develop accelerators that deliver meaningful performance improvements. Explore the key challenges blocking edge LLM deployment, including high-level synthesis cycles that halt progress, memory-bound inference that doesn't benefit from additional CPU threads, and quantization formats that don't map efficiently to general-purpose cores. Understand how llama.cpp, GGUF format, and deep quantization techniques enable compact models across diverse hardware platforms. Examine the SECDA-LLM toolkit that offloads critical kernels through a GGML backend, enabling custom FPGA operator prototyping while maintaining clean, portable code architecture. Analyze two practical implementations: a format-aware matrix multiplication engine for TinyLlama that handles packed weights, applies block and superblock scalars, and optimizes tile scheduling for maximum reuse, achieving up to 11x latency reduction on ARM+FPGA boards compared to CPU-only execution. Study a dynamic superblock processor for mixed block floating point operations across layers, supporting formats like Q3K and Q2 simultaneously by running scale paths in parallel and performing late selection to eliminate inner-loop branches. Follow a comprehensive roadmap for future development including broader BFP support for 4-6 bits, emerging attention variants, shift-based arithmetic for cost reduction, and sparsity integration into dataflow architectures. Master the simulation-first workflow of simulate, measure, refine, then synthesize, transforming edge LLM acceleration from complex engineering challenges into manageable development cycles for low-latency, private inference on resource-constrained devices.