Transcription Models Zero to Hero - Data Prep, Train and Serve

Learn to build, train, and deploy custom transcription models from scratch in this comprehensive 19-minute tutorial that covers the complete pipeline from data preparation to model serving. Discover the essential data requirements including audio recordings and transcripts, then master the process of uploading audio-text pairs and preparing datasets for optimal training results. Explore advanced techniques such as saving word swaps and using existing models to generate better training transcripts, while understanding the critical warning about clean text being out-of-distribution for smaller models. Set up your development environment by configuring Hugging Face tokens and Weights & Biases keys, then create robust validation sets using ChatGPT to rephrase text for better evaluation. Configure training settings and advanced parameters to achieve optimal performance, starting with baseline evaluation metrics showing 7.09% word error rate. Monitor the training process as loss and word error rates decrease, observe model training progress including high gradient norm observations, and learn how models and logs are automatically pushed to Hugging Face Hub. Analyze evaluation results and examine specific corrections made by the fine-tuned model, including spelling improvements and potential regressions. Master the deployment process by setting up model endpoints with keep warm features, understand auto-sleep container functionality and API key access options, then test your deployed endpoint and explore various transcript download formats. Conclude by exploring evaluation tab features and discovering future text-to-speech integration plans for expanding your transcription capabilities.

Syllabus

Introduction and overview of pipeline
Data requirements: audio recordings and transcripts
Uploading audio-text pairs and dataset preparation
Saving word swaps and transcribing with model for better training
Warning about clean text being out-of-distribution for small models
Setting up Hugging Face token and Weights & Biases key
Creating validation set using ChatGPT to rephrase text
Configuring training settings and advanced parameters
Baseline evaluation shows 7.09% word error rate
Training begins with falling loss and word error rate
Model training progress and high grad norm observation
Model and logs pushed to Hugging Face Hub
Inspecting evaluation results and specific corrections
Spelling improvements and regressions in fine-tuned model
Deploying model to endpoint with keep warm feature
Auto-sleep containers and API key access options
Testing endpoint and transcript download formats
Evaluation tab features and future text-to-speech plans