Deploying GenAI: Overcoming Challenges in Performance, Security, and Efficiency

In this 25-minute conference talk, learn how to deploy generative AI models directly on resource-constrained devices to ensure autonomy, security, and real-time performance. Explore systematic approaches for implementing Large Language Models (LLMs) and Transformers on autonomous vehicles, drones, and IoT devices without relying on cloud infrastructure. Examine state-of-the-art system-on-chips (SoCs) from leading manufacturers, understanding their capabilities and limitations for AI workloads. Discover essential model compression techniques including quantization, pruning, and knowledge distillation, with practical insights from a real-world case study showing how Small Language Models like Meta's Llama 3.2 can run efficiently on Qualcomm Snapdragon SoCs. Master the engineering techniques needed to evaluate hardware accelerators, apply compression methods without sacrificing model capabilities, balance model size with efficacy, and leverage emerging SLM trends to future-proof applications. Presented by Jonna Matthiesen, deep learning researcher at Embedl specializing in AI optimization for defense, automotive, and IoT applications, recorded at the 2025 GAIA Conference in Gothenburg, Sweden.