Streaming Speech to Text Models - Kyutai vs Whisper

Learn to implement and compare streaming speech-to-text models by exploring Kyutai's real-time transcription capabilities against OpenAI's Whisper in this comprehensive technical tutorial. Discover how to set up and run Kyutai TTS on Mac systems, implement streaming transcription in Jupyter notebooks, and leverage word-level timestamping for precise audio analysis. Master text and audio-assisted transcription techniques while building a high-performance streaming TTS server using Rust for production environments. Compare the architectural differences between Whisper and Kyutai models, understand the theoretical foundations of timestamping in speech recognition, and explore how Kyutai trains on Whisper's timestamped data to achieve superior streaming performance. Gain hands-on experience with both English and French language processing while evaluating streaming capabilities against traditional batch processing methods like Whisper and Voxtral.

Syllabus

0:00 Streaming Speech to Text Demo with Kyutai TTS
0:42 Demo en français
1:05 Video Overview
2:42 Resources & Repo
3:15 Running Kyutai TTS on your Mac
5:15 Run streaming TTS in a notebook
5:58 Word timestamping
8:52 Text and Audio Assisted Transcription
11:46 Fast STREAMING TTS server with Rust
15:27 Streaming vs Whisper TTS vs Voxtral
19:53 Theory of Timestamping
22:55 Whisper vs Kyutai TTS architectures
24:34 How Kyutai is trained with whisper timestamped data
25:50 Wrap up