Vision-Language Models - A New Architecture for Embedding Models
Qdrant - Vector Database & Search Engine via YouTube
Master Finance Tools - 35% Off CFI (Code CFI35)
Earn Your Business Degree, Tuition-Free, 100% Online!
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the cutting-edge architecture of Vision-Language Models (VLMs) and their application as embedding models in this 19-minute conference talk from Qdrant's Vector Space Day 2025. Discover how transformer architectures enable VLMs to learn from mixed text-image inputs and serve as powerful backbones for embedding models like jina-embeddings-v4. Learn about training insights for VLM-based embedding models that support both dense single-vector and late-interaction multi-vector retrieval across multiple domains, tasks, and languages. Examine the particular strengths of VLMs when processing images containing text and diagrams, UI screenshots, and illustrations. Understand critical factors affecting performance including image resolution, retrieval objectives, and the impact of the modality gap on retrieval effectiveness. Gain insights into model evaluation methodologies and operational efficiency considerations through comparisons of post-training quantization versus quantization-aware training, including trade-offs between model footprint, throughput, and accuracy.
Syllabus
Vision-Language Models: A New Architecture for Embedding Models | Jina AI | Michael Günther
Taught by
Qdrant - Vector Database & Search Engine