Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Training Vision Language Models from Scratch Using Text-Only LLMs

Neural Breakdown with AVB via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to build Vision Language Models from scratch by transforming text-only Large Language Models into multimodal systems capable of processing both text and images. Explore the Query Former (Q-Former) architecture from the BLIP-2 paper through visual explanations and hands-on coding implementation. Master Vision Transformers and their integration with language models, understand cross-attention mechanisms in transformer architectures, and implement Q-Former models using BERT as a foundation. Follow a comprehensive step-by-step coding guide that covers ViT implementation, Q-Former development, and LORA fine-tuning techniques for language models, providing you with practical skills to train your own Vision Language Models with complete code examples and thorough explanations of multimodal AI architecture.

Syllabus

- Intro
- Vision Transformers
- Coding ViT
- Q-Former models
- Coding Q-Former from a BERT
- Cross Attention in Transformers
- Coding Q-Formers
- LORA finetune Language Model
- Summary

Taught by

Neural Breakdown with AVB

Reviews

Start your review of Training Vision Language Models from Scratch Using Text-Only LLMs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.