Completed
MMS-FA model and commercially-licensed CTC aligner alternative
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Align Audio and Text for Speech Recognition Model Training
Automatically move to the next video in the Classroom when playback concludes
- 1 Introduction to audio-text alignment for model training
- 2 Demo of Audio Alignment interface with two models
- 3 MMS-FA model and commercially-licensed CTC aligner alternative
- 4 Word timestamps enable sentence detection for clean training chunks
- 5 Multi-step alignment process: normalization, emissions, and character probabilities
- 6 Viterbi process calculates most likely path for final alignment
- 7 Trelis Studio data preparation workflow with audio upload
- 8 Realignment process creates clean 20-30 second chunks with sentence boundaries
- 9 Review of resulting dataset with clean chunks and word timestamps
- 10 Fine-tuning Whisper without timestamps causes catastrophic forgetting
- 11 Emissions are character probabilities generated per audio frame
- 12 Wave2Vec models and text normalization process
- 13 Torch Audio forced aligner non-commercial license restriction
- 14 Viterbi method for mapping ground truth text to audio windows
- 15 Emissions model training using unlabeled data approach
- 16 Multiple valid sequences kept during alignment process
- 17 Wave2Vec pre-training uses masked audio prediction
- 18 Conclusion with repository reference at Trelis.com/advanced-audio