Building Next-Gen Speech Synthesis for Bookmate Audiobook Service

In this 22-minute conference talk from DSC EUROPE 24, Vladimir Platonov delves into the development and evolution of text-to-speech technology for Bookmate's audiobook service. Discover how they built a high-quality TTS system using just 90 hours of training data that was successfully deployed across more than 100,000 books. Learn about critical aspects of audiobook production including voice selection, dataset construction, and strategies for meeting user expectations for synthetic speech. Follow the journey to next-generation TTS as Vladimir explains how they leveraged tens of thousands of hours of data, managed large-scale datasets, implemented advanced neural network architectures, and developed effective quality assessment methods through crowdsourcing and user feedback. This presentation provides valuable insights for anyone interested in speech synthesis technology and its practical applications in digital content delivery.