Building Awesome Speech-to-Text Transformers from Scratch - One Line of PyTorch at a Time

Learn to build a Speech-to-Text (STT) audio transcription model from scratch in this 53-minute tutorial by Neural Breakdown with AVB. Walk through the complete pipeline implementation using PyTorch, including 1D Convolutional Layers for raw waveform processing, Transformer-style Self-Attention Layers, Residual Vector Quantization (RVQ) for efficient representation, and CTC Loss for sequence alignment. Perfect for both beginners in speech recognition and those wanting to understand the internals of models like wav2vec 2.0. The tutorial covers audio dataset structure, text tokenization, data preprocessing, network architecture, convolutional blocks, attention mechanisms, transformers, and training processes. Gain practical coding skills with a hands-on approach that avoids simply using pre-built libraries.

Syllabus

0:00 - Intro
0:36 - How Audio datasets look like
4:30 - Tokenizing text
9:34 - Data Preprocessing
11:38 - MFCCs, and Encoder-Decoder networks
14:20 - Network Architecture
17:59 - Coding the Convolutional Block
26:40 - Coding attention and Transformers
30:20 - Residual Vector Quantizers
32:57 - Coding RVQs
37:44 - Optimizing RVQs
43:50 - Putting it together
48:50 - Connectionist-Temporal Classification CTC Loss
50:53 - Training!