Smaller, Stronger, and Duration-Scalable Audio Learners

In this 31-minute conference talk from MIT, postdoctoral researcher Saurabhchand Bhati presents innovative research on state-space models (SSMs) for audio processing. Learn about the Knowledge Distilled Audio SSM (DASS), a breakthrough model that outperforms Transformers on AudioSet with a smaller footprint, achieving an mAP of 48.9 while reducing model size by one-third. Discover how DASS overcomes traditional SSM limitations in short audio tagging tasks while maintaining exceptional performance on long-duration audio through the Audio Needle In A Haystack test. Bhati, whose research focuses on unsupervised spoken term discovery, representation learning, and multimodal learning, demonstrates how these models can effectively identify sound events in hour-long recordings where Transformer models fail beyond 50 seconds.