Sesame AI and RVQs - The Network Architecture Behind Viral Speech Models
Neural Breakdown with AVB via YouTube
Gain a Splash of New Skills - Coursera+ Annual Nearly 45% Off
AI Adoption - Drive Business Value and Organizational Impact
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore the groundbreaking Sesame Conversational Speech Model in this 19-minute technical video from Neural Breakdown with AVB. Dive into the architecture of this powerful speech-to-speech AI that enables expressive talking, intelligent responses, and natural interactions. Learn about the Mimi Encoder's audio tokenization using split RVQ (Residual Vector Quantization), understand the critical role of semantic and acoustic codes in audio comprehension, and follow a detailed step-by-step breakdown of the Autoregressive Transformer Backbone and Audio Decoder. The video references key research papers including Moshi, SoundStream, HuBert, and Speech Tokenizer, providing a comprehensive technical overview of the network architecture behind viral speech models. Additional resources include access to supplementary materials through Patreon, related videos on transformers, and guides to fine-tuning open source LLMs.
Syllabus
Sesame AI and RVQs - the network architecture behind VIRAL speech models
Taught by
Neural Breakdown with AVB