End-to-End Adversarial Text-to-Speech - Paper Explained

Explore an in-depth analysis of a groundbreaking paper on end-to-end adversarial text-to-speech synthesis in this 41-minute video lecture. Delve into the challenges of traditional multi-stage TTS pipelines and discover how this innovative approach tackles the alignment problem using an advanced alignment module. Learn about the adversarial training technique, the architectures of the discriminator and generator, and the novel use of dynamic time warping for capturing temporal variations in generated audio. Gain insights into the spectrogram prediction loss and how this method achieves high-quality speech synthesis comparable to state-of-the-art models, all while operating directly on character or phoneme input sequences.

Syllabus

- Intro & Overview
- Problems with Text-to-Speech
- Adversarial Training
- End-to-End Training
- Discriminator Architecture
- Generator Architecture
- The Alignment Problem
- Aligner Architecture
- Spectrogram Prediction Loss
- Dynamic Time Warping
- Conclusion