Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Gemini 2.5 Pro and Qwen 2.5 VL for Object Detection - Benchmarking LLMs for Vision Tasks with RF100-VL

Roboflow via YouTube

Start learning Write review

Details

Start learning

Provider

YouTube
Pricing

Free Video
Languages

English
Effort

41 minutes
Sessions

Self-Paced
Level

Intermediate

Found in

Explore how vision-language models (VLMs) perform on object detection tasks in this 41-minute video featuring Machine Learning Engineer Matvei Popov's research findings. Discover the capabilities of large pre-trained models like Gemini 2.5 Pro, Qwen 2.5 VL, and GroundingDINO for object detection. Learn about the challenges of VLM generalization, the differences between pre-trained VLMs and task-specific vision models, and the potential benefits of using VLMs for detection tasks. Follow along as the RF100-VL benchmark for evaluating VLMs on object detection is introduced, with detailed explanations of evaluation methodologies, prompting strategies, and comparative performance results across different models. Gain valuable insights into leveraging pre-training data for zero-shot detection capabilities and understand the future implications for computer vision applications.

Syllabus

00:00 Introduction: Do VLMs Struggle to Generalize on Object Detection Tasks?
03:28 Understanding Pre-Trained VLMs vs. Task-Specific Vision Models
04:54 Why Even Use VLMs for Object Detection?
09:48 Can We Leverage VLMs Pre-Training Data for Zero-Shot Detections?
12:18 Introducing RF100-VL: Object Detection Benchmark for VLMs
17:52 How to Evaluate Object Detection Capabilities in VLMs
21:46 Example: Comparing Evaluation Performance
25:34 Prompting Strategies for Object Detection Tests
30:10 Results! Comparing VLMs Object Detection Scores
37:43 Conclusion, Takeaways, and Looking Forward