Training Vision Language Models from Scratch Using Text-Only LLMs
Neural Breakdown with AVB via YouTube
-
10
-
- Write review
Power BI Fundamentals - Create visualizations and dashboards from scratch
AI Product Expert Certification - Master Generative AI Skills
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to build Vision Language Models from scratch by transforming text-only Large Language Models into multimodal systems capable of processing both text and images. Explore the Query Former (Q-Former) architecture from the BLIP-2 paper through visual explanations and hands-on coding implementation. Master Vision Transformers and their integration with language models, understand cross-attention mechanisms in transformer architectures, and implement Q-Former models using BERT as a foundation. Follow a comprehensive step-by-step coding guide that covers ViT implementation, Q-Former development, and LORA fine-tuning techniques for language models, providing you with practical skills to train your own Vision Language Models with complete code examples and thorough explanations of multimodal AI architecture.
Syllabus
- Intro
- Vision Transformers
- Coding ViT
- Q-Former models
- Coding Q-Former from a BERT
- Cross Attention in Transformers
- Coding Q-Formers
- LORA finetune Language Model
- Summary
Taught by
Neural Breakdown with AVB