Neural Target Speech and Sound Extraction

Explore the fascinating world of neural target speech and sound extraction in this comprehensive plenary lecture delivered at JSALT 2025. Delve into the cocktail party effect and selective hearing phenomenon that allows humans to isolate desired sounds from complex acoustic environments, such as focusing on a conversation in a noisy café or identifying specific instruments in musical compositions. Learn about target speech/sound extraction (TSE) techniques that use neural networks to isolate target speakers or sounds from audio mixtures using various identifying clues including spatial information, visual cues from video, or enrollment audio samples. Discover the foundational principles behind TSE technology and examine cutting-edge research developments in neural-based approaches for both speech and arbitrary sound extraction. Gain insights from Distinguished Researcher Marc Delcroix of NTT Communication Science Laboratories, whose expertise spans speech enhancement, robust speech recognition, model adaptation, and speaker diarization, and who has contributed significantly to major challenges and conferences in the field including CHiME, REVERB, ASRU, and SLT.