Target Sound Extraction with Language-oriented Audio Diffusion Transformer
Center for Language & Speech Processing(CLSP), JHU via YouTube
Finance Certifications Goldman Sachs & Amazon Teams Trust
Google, IBM & Microsoft Certificates — All in One Plan
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore a 31-minute research presentation from Johns Hopkins University's Center for Language & Speech Processing that introduces SoloAudio, a groundbreaking diffusion-based generative model for target sound extraction. Learn how this innovative system utilizes a skip-connected Transformer architecture operating on latent features, replacing traditional U-Net backbones. Discover the model's integration with CLAP for both audio and language-oriented sound extraction, and understand how it leverages synthetic audio from text-to-audio models during training. Examine SoloAudio's impressive capabilities in generalizing to out-of-domain data, handling novel sound events, and performing zero-shot and few-shot learning tasks.
Syllabus
Solo Audio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer -- Helin Wang
Taught by
Center for Language & Speech Processing(CLSP), JHU