Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Gemini 2.5 Pro and Qwen 2.5 VL for Object Detection - Benchmarking LLMs for Vision Tasks with RF100-VL

Roboflow via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore how vision-language models (VLMs) perform on object detection tasks in this 41-minute video featuring Machine Learning Engineer Matvei Popov's research findings. Discover the capabilities of large pre-trained models like Gemini 2.5 Pro, Qwen 2.5 VL, and GroundingDINO for object detection. Learn about the challenges of VLM generalization, the differences between pre-trained VLMs and task-specific vision models, and the potential benefits of using VLMs for detection tasks. Follow along as the RF100-VL benchmark for evaluating VLMs on object detection is introduced, with detailed explanations of evaluation methodologies, prompting strategies, and comparative performance results across different models. Gain valuable insights into leveraging pre-training data for zero-shot detection capabilities and understand the future implications for computer vision applications.

Syllabus

00:00 Introduction: Do VLMs Struggle to Generalize on Object Detection Tasks?
03:28 Understanding Pre-Trained VLMs vs. Task-Specific Vision Models
04:54 Why Even Use VLMs for Object Detection?
09:48 Can We Leverage VLMs Pre-Training Data for Zero-Shot Detections?
12:18 Introducing RF100-VL: Object Detection Benchmark for VLMs
17:52 How to Evaluate Object Detection Capabilities in VLMs
21:46 Example: Comparing Evaluation Performance
25:34 Prompting Strategies for Object Detection Tests
30:10 Results! Comparing VLMs Object Detection Scores
37:43 Conclusion, Takeaways, and Looking Forward

Taught by

Roboflow

Reviews

Start your review of Gemini 2.5 Pro and Qwen 2.5 VL for Object Detection - Benchmarking LLMs for Vision Tasks with RF100-VL

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.