Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Building Awesome Speech-to-Text Transformers from Scratch - One Line of PyTorch at a Time

Neural Breakdown with AVB via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to build a Speech-to-Text (STT) audio transcription model from scratch in this 53-minute tutorial by Neural Breakdown with AVB. Walk through the complete pipeline implementation using PyTorch, including 1D Convolutional Layers for raw waveform processing, Transformer-style Self-Attention Layers, Residual Vector Quantization (RVQ) for efficient representation, and CTC Loss for sequence alignment. Perfect for both beginners in speech recognition and those wanting to understand the internals of models like wav2vec 2.0. The tutorial covers audio dataset structure, text tokenization, data preprocessing, network architecture, convolutional blocks, attention mechanisms, transformers, and training processes. Gain practical coding skills with a hands-on approach that avoids simply using pre-built libraries.

Syllabus

0:00 - Intro
0:36 - How Audio datasets look like
4:30 - Tokenizing text
9:34 - Data Preprocessing
11:38 - MFCCs, and Encoder-Decoder networks
14:20 - Network Architecture
17:59 - Coding the Convolutional Block
26:40 - Coding attention and Transformers
30:20 - Residual Vector Quantizers
32:57 - Coding RVQs
37:44 - Optimizing RVQs
43:50 - Putting it together
48:50 - Connectionist-Temporal Classification CTC Loss
50:53 - Training!

Taught by

Neural Breakdown with AVB

Reviews

Start your review of Building Awesome Speech-to-Text Transformers from Scratch - One Line of PyTorch at a Time

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.