Designing the Next-Generation Foundation Model Architecture for Edge AI

Explore how to build foundation models specifically optimized for edge devices through hardware-aware architecture design in this 18-minute conference talk. Learn about the development of LFM2 "Liquid Foundation Models" that prioritize real-world performance constraints like memory bandwidth, KV cache limitations, and decode latency on smartphones, laptops, and edge hardware. Discover the STAR evolutionary search methodology that combines attention mechanisms with state space models, linear attention, and convolutional components to create modular architectures tailored for different device classes. Understand the hardware-in-the-loop evaluation process that measures actual performance metrics on devices like the Galaxy S24, balancing inference speed with model quality on benchmarks including MMLU and multilingual assessments. Examine the practical results showing LFM2's superior decode performance compared to larger transformer models like Llama variants, with open-weight releases ranging from 350M to 2.6B parameters plus MoE variants. See how this efficient backbone extends beyond text to vision-language models for on-device video understanding and native audio processing that interleaves text and audio tokens to reduce the traditional speech-to-text-to-LLM-to-speech pipeline latency. Gain insights into deploying AI models locally for privacy and low-latency applications across phones and PCs, with practical integration examples using llama.cpp and NPU implementations for Qualcomm and upcoming AMD support.