Explore advanced computer vision concepts through this comprehensive university lecture series covering cutting-edge visual-language models and multimodal AI systems. Delve into transformer architectures and their applications in computer vision, starting with foundational concepts before progressing to sophisticated models like CLIP for connecting vision and language. Examine state-of-the-art visual-language models including CoCA, PALI, FLAMINGO, FLAVA, PAINTER, and BLIP-2, understanding how they bridge the gap between visual and textual understanding. Investigate multimodal binding techniques through Image-Bind and Language-Bind systems, and discover how large language models are integrated with visual processing in LLaVA architectures. Study video understanding capabilities through Video ChatGPT and PG-Video LLaVA implementations, learning how temporal information enhances multimodal reasoning. Master fine-grained language-image pre-training methods with FILIP and hierarchical attention mechanisms in HiCLIP. Analyze bootstrapping techniques for unified vision-language understanding through BLIP and BLIP-2 architectures that leverage frozen encoders and large language models. Explore joint learning approaches for multimodal tasks using MaMMUT architecture and neural script knowledge integration through MERLOT RESERVE. Understand referential dialogue capabilities in multimodal systems with Shikra and advanced video-language alignment techniques in Video-LLaVA. Examine pixel-level grounding in large video-language models and critical evaluation methods for object hallucination in vision-language systems. Investigate visual shortcomings and limitations of current multimodal large language models. Study scaling approaches for autoregressive multi-modal models through CM3Leon and open-vocabulary object detection with OWLv2. Learn about grounded pre-training techniques combining DINO with object detection and caption enrichment methods using large language models in FuseCap.

Syllabus

Lecture 1 - Introduction
Lecture 2 - Transformers Introduction
Lecture 3 - CLIP
Lecture 4 - Visual-Language Models Introduction Part-I: CoCA, PALI
Lecture 5 - Visual-Language Models Introduction Part-II: FLAMINGO, FLAVA, PAINTER, BLIP-2
Lecture 6 - Visual-Language Models Introduction Part-III: Image-Bind, Language-Bind, LLaVA
Lecture 7 - Visual-Language Models Introduction Part-IV: Video ChatGPT, PG-Video LLaVA
Lecture 8 - FILIP: Fine-grained Interactive Language-Image Pre-Training
Lecture 9 - HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
Lecture 10-BLIP:Bootstrapping Language-Image Pretraining for Unified VL Understanding and Generation
Lecture 11 - BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs
Lecture 12 - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Lecture 13 - MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
Lecture 14 - Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Lecture 15 - Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Lecture 16 - PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Lecture 17 - Evaluating Object Hallucination in Large Vision-Language Models
Lecture 18 - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Lecture 19 - CM3Leon: Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
Lecture 20 - OWLv2: Scaling Open-Vocabulary Object Detection
Lecture 21 - Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Lecture 22 - FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions