Whisper Data Preparation and Fine-Tuning with Unsloth

Learn to prepare audio data and fine-tune OpenAI's Whisper speech recognition model using Unsloth's optimization framework in this comprehensive 41-minute tutorial. Set up a one-click GPU environment with Jupyter notebooks and explore the differences between Whisper, Voxtral, and Kyutai models before deciding between Whisper Large and Turbo variants. Master the installation process for Unsloth and Whisper Timestamped tools, then dive into practical audio data preparation techniques including recording, transcription, and the critical differences between standard Whisper and Whisper-Timestamped for obtaining word-level timestamps. Discover how to create precise text and audio segments using word-timestamped transcripts, understand the challenges of chunking audio segments under 30 seconds, and implement both automated and manual transcript cleanup techniques to improve data quality. Build datasets from processed audio and text segments, then proceed with fine-tuning using Unsloth's efficient training methods. Evaluate model performance using Word Error Rate metrics, comparing teacher forcing versus predict_with_generate approaches, and analyze training hyperparameters, losses, and results. Compare base model performance against your fine-tuned version, then learn to merge models, push them to Hugging Face Hub, and prepare them for inference deployment.

Syllabus

0:00 Whisper preparation and fine-tuning with Unsloth
0:40 Resources: Trelis.com/ADVANCED-audio
1:23 One-click GPU and Jupyter Notebook Setup
3:37 Whisper vs Voxtral vs Kyutai
4:48 Installation of Unsloth and Whisper Timestamped
7:52 Using Whisper Large versus Turbo
8:53 Video Overview / Layout - How to prepare data and train
11:33 Audio recording and transcription with whisper timestamped
13:06 Whisper vs Whisper-Timestamped and the motivation for word timestamps
15:34 Creating text/audio segments using word-timestamped transcripts
18:46 Segment time-stamps using whisper not easy to then chunk to less than 30s!!!
19:50 Word time-stamps with Whisper Timestamped
20:56 Automated vs manual transcript cleanup techniques
28:48 Dataset creation from audio and text segments
20:53 Fine-tuning with Unsloth
33:26 Word Error Rate - Teacher Force versus predict_with_generate
36:11 Training hyperparameters and losses / results
37:33 Evaluating base and fine-tuned model performance
39:15 Merging, pushing to hub and preparing for inference see also https://www.youtube.com/watch?v=qXtPPgujufI
40:21 Conclusion