Overview
Syllabus
Introduction to audio-text alignment for model training
Demo of Audio Alignment interface with two models
MMS-FA model and commercially-licensed CTC aligner alternative
Word timestamps enable sentence detection for clean training chunks
Multi-step alignment process: normalization, emissions, and character probabilities
Viterbi process calculates most likely path for final alignment
Trelis Studio data preparation workflow with audio upload
Realignment process creates clean 20-30 second chunks with sentence boundaries
Review of resulting dataset with clean chunks and word timestamps
Fine-tuning Whisper without timestamps causes catastrophic forgetting
Emissions are character probabilities generated per audio frame
Wave2Vec models and text normalization process
Torch Audio forced aligner non-commercial license restriction
Viterbi method for mapping ground truth text to audio windows
Emissions model training using unlabeled data approach
Multiple valid sequences kept during alignment process
Wave2Vec pre-training uses masked audio prediction
Conclusion with repository reference at Trelis.com/advanced-audio
Taught by
Trelis Research