Formant Synthesis, Concatenative Synthesis and Statistical Methods for Text-to-Speech

Explore the foundational text-to-speech synthesis methods that dominated the field before neural networks revolutionized speech generation in 2016. Delve into formant synthesis and discover how it modeled the human vocal tract to generate speech sounds, understanding both its capabilities and limitations in creating natural-sounding voices. Master concatenative synthesis techniques, including diphone concatenation and unit selection methods that stitched together recorded speech units to form coherent utterances. Examine statistical parametric synthesis using Hidden Markov Models (HMM) and learn why these approaches, while groundbreaking for their time, often produced robotic or over-smoothed vocal output. Analyze the pros and cons of each traditional method and understand how these classic techniques laid the essential groundwork for modern neural text-to-speech systems. Gain comprehensive knowledge of the evolution of speech synthesis technology through detailed explanations of vocal tract modeling, speech unit concatenation strategies, and statistical approaches that shaped the development of contemporary voice generation systems.

Syllabus

Intro
Formant synthesis
Formant: Pros and cons
Concatenative synthesis
Diphone concatenation
Unit selection
Concat: Pros and cons
Statistical parametric synthesis HMM
HMM-based TTS: Pros and cons
Comparing traditional TTS