Target Sound Extraction with Language-oriented Audio Diffusion Transformer
Center for Language & Speech Processing(CLSP), JHU via YouTube
Gain a Splash of New Skills - Coursera+ Annual Nearly 45% Off
Master Finance Tools - 35% Off CFI (Code CFI35)
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore a 31-minute research presentation from Johns Hopkins University's Center for Language & Speech Processing that introduces SoloAudio, a groundbreaking diffusion-based generative model for target sound extraction. Learn how this innovative system utilizes a skip-connected Transformer architecture operating on latent features, replacing traditional U-Net backbones. Discover the model's integration with CLAP for both audio and language-oriented sound extraction, and understand how it leverages synthetic audio from text-to-audio models during training. Examine SoloAudio's impressive capabilities in generalizing to out-of-domain data, handling novel sound events, and performing zero-shot and few-shot learning tasks.
Syllabus
Solo Audio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer -- Helin Wang
Taught by
Center for Language & Speech Processing(CLSP), JHU