Sesame AI and RVQs - The Network Architecture Behind Viral Speech Models

Explore the groundbreaking Sesame Conversational Speech Model in this 19-minute technical video from Neural Breakdown with AVB. Dive into the architecture of this powerful speech-to-speech AI that enables expressive talking, intelligent responses, and natural interactions. Learn about the Mimi Encoder's audio tokenization using split RVQ (Residual Vector Quantization), understand the critical role of semantic and acoustic codes in audio comprehension, and follow a detailed step-by-step breakdown of the Autoregressive Transformer Backbone and Audio Decoder. The video references key research papers including Moshi, SoundStream, HuBert, and Speech Tokenizer, providing a comprehensive technical overview of the network architecture behind viral speech models. Additional resources include access to supplementary materials through Patreon, related videos on transformers, and guides to fine-tuning open source LLMs.