Robust Speech Recognition I - Day 7 Morning

Learn robust speech recognition techniques for real-world conversational scenarios in this comprehensive tutorial from JSALT Summer School 2025. Explore the significant challenges facing automatic speech recognition (ASR) systems when transitioning from clean laboratory conditions to practical applications like meeting transcription, where word error rates can exceed 35% compared to under 3% on clean data. Examine the core obstacles including background noise, reverberation, multiple simultaneous speakers, and overlapping speech that occurs in over 15% of meeting duration. Master evaluation methodologies for long-form multi-speaker audio processing, including concatenated minimum permutation word error rate (cpWER), and survey essential datasets ranging from AMI to current benchmarks like CHiME-7/8 and NOTSOFAR1. Discover technical approaches categorized into front-end methods such as speech separation, beamforming, and target speaker extraction, alongside back-end methods including self-supervised features, serialized output training, and target-speaker ASR. Understand how large language models are enabling new applications like automated meeting summarization while creating fresh research opportunities. Address key challenges in speaker tracking, training-inference mismatches, and the integration of speech separation, diarization, and recognition components in this active research field with significant potential for advancement.