Audio-Visual Active Speaker Detection on Embedded Devices - Optimizing Deep Learning Models

Watch a technical conference presentation exploring the development and optimization of Active Speaker Detection (ASD) models for embedded devices. Learn how researchers at NXP Semiconductors created computationally efficient deep learning architectures that can identify active speakers in video by analyzing both visual and audio features in real-time. Discover the innovative approaches used to drastically reduce computational costs through multi-objective optimization and a novel modality fusion scheme, enabling implementation on both high-end MPUs and resource-constrained MCUs. Follow the complete optimization journey from model design modifications to quantization and integration on NXP devices, with detailed analysis of the trade-offs between computational efficiency and system accuracy.