Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Introduction to Multimodal Large Language Models I - Day 10 Morning

Center for Language & Speech Processing(CLSP), JHU via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the fundamentals of multimodal large language models in this comprehensive lecture from JSALT 2025. Learn the core concepts of LLMs including next token prediction, transformers, conformer architectures, and the differences between training and inference processes. Discover supervised fine-tuning (SFT) versus LoRA fine-tuning approaches, and understand how CLIP and CLAP models handle images and audio respectively. Examine the trade-offs between fine-tuning and full training methods, along with essential loss functions including contrastive and cross-attention approaches. Delve into the specifics of training audio language models, covering audio encoders such as CLAP, AST, and Whisper, and explore large audio LLMs including Audio Flamingo-2 and Flamingo-3. Understand dataset creation and synthetic augmentation techniques for audio applications, learn about creating Audio Question Answering (AQA) benchmarks, and explore future directions in audio generation and audio-to-audio processing. Gain insights from leading researchers in multimodal AI who bring expertise from the University of Maryland, Brno University of Technology, and Universidad Autónoma de Madrid, as part of the workshop research group focused on advancing expert-level reasoning and understanding in large audio language models.

Syllabus

[camera] Day 10 morning - JSALT 2025 - Introduction to Multimodal Large Language Models I.

Taught by

Center for Language & Speech Processing(CLSP), JHU

Reviews

Start your review of Introduction to Multimodal Large Language Models I - Day 10 Morning

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.