Vision-Language Models - A New Architecture for Embedding Models
Qdrant - Vector Database & Search Engine via YouTube
2,000+ Free Courses with Certificates: Coding, AI, SQL, and More
Get 20% off all career paths from fullstack to AI
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore the cutting-edge architecture of Vision-Language Models (VLMs) and their application as embedding models in this 19-minute conference talk from Qdrant's Vector Space Day 2025. Discover how transformer architectures enable VLMs to learn from mixed text-image inputs and serve as powerful backbones for embedding models like jina-embeddings-v4. Learn about training insights for VLM-based embedding models that support both dense single-vector and late-interaction multi-vector retrieval across multiple domains, tasks, and languages. Examine the particular strengths of VLMs when processing images containing text and diagrams, UI screenshots, and illustrations. Understand critical factors affecting performance including image resolution, retrieval objectives, and the impact of the modality gap on retrieval effectiveness. Gain insights into model evaluation methodologies and operational efficiency considerations through comparisons of post-training quantization versus quantization-aware training, including trade-offs between model footprint, throughput, and accuracy.
Syllabus
Vision-Language Models: A New Architecture for Embedding Models | Jina AI | Michael Günther
Taught by
Qdrant - Vector Database & Search Engine