Explore a 31-minute research presentation from Johns Hopkins University's Center for Language & Speech Processing that introduces SoloAudio, a groundbreaking diffusion-based generative model for target sound extraction. Learn how this innovative system utilizes a skip-connected Transformer architecture operating on latent features, replacing traditional U-Net backbones. Discover the model's integration with CLAP for both audio and language-oriented sound extraction, and understand how it leverages synthetic audio from text-to-audio models during training. Examine SoloAudio's impressive capabilities in generalizing to out-of-domain data, handling novel sound events, and performing zero-shot and few-shot learning tasks.