Training Vision Language Models from Scratch Using Text-Only LLMs

Learn to build Vision Language Models from scratch by transforming text-only Large Language Models into multimodal systems capable of processing both text and images. Explore the Query Former (Q-Former) architecture from the BLIP-2 paper through visual explanations and hands-on coding implementation. Master Vision Transformers and their integration with language models, understand cross-attention mechanisms in transformer architectures, and implement Q-Former models using BERT as a foundation. Follow a comprehensive step-by-step coding guide that covers ViT implementation, Q-Former development, and LORA fine-tuning techniques for language models, providing you with practical skills to train your own Vision Language Models with complete code examples and thorough explanations of multimodal AI architecture.

Syllabus

- Intro
- Vision Transformers
- Coding ViT
- Q-Former models
- Coding Q-Former from a BERT
- Cross Attention in Transformers
- Coding Q-Formers
- LORA finetune Language Model
- Summary