CAP6412 Advanced Computer Vision - Spring 2024

CAP6412 Advanced Computer Vision - Spring 2024

UCF CRCV via YouTube Direct link

Lecture 22 - FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

22 of 22

22 of 22

Lecture 22 - FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

CAP6412 Advanced Computer Vision - Spring 2024

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Lecture 1 - Introduction
  2. 2 Lecture 2 - Transformers Introduction
  3. 3 Lecture 3 - CLIP
  4. 4 Lecture 4 - Visual-Language Models Introduction Part-I: CoCA, PALI
  5. 5 Lecture 5 - Visual-Language Models Introduction Part-II: FLAMINGO, FLAVA, PAINTER, BLIP-2
  6. 6 Lecture 6 - Visual-Language Models Introduction Part-III: Image-Bind, Language-Bind, LLaVA
  7. 7 Lecture 7 - Visual-Language Models Introduction Part-IV: Video ChatGPT, PG-Video LLaVA
  8. 8 Lecture 8 - FILIP: Fine-grained Interactive Language-Image Pre-Training
  9. 9 Lecture 9 - HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
  10. 10 Lecture 10-BLIP:Bootstrapping Language-Image Pretraining for Unified VL Understanding and Generation
  11. 11 Lecture 11 - BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs
  12. 12 Lecture 12 - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
  13. 13 Lecture 13 - MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
  14. 14 Lecture 14 - Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
  15. 15 Lecture 15 - Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
  16. 16 Lecture 16 - PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
  17. 17 Lecture 17 - Evaluating Object Hallucination in Large Vision-Language Models
  18. 18 Lecture 18 - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
  19. 19 Lecture 19 - CM3Leon: Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
  20. 20 Lecture 20 - OWLv2: Scaling Open-Vocabulary Object Detection
  21. 21 Lecture 21 - Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
  22. 22 Lecture 22 - FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.