Vision-Language Models - A New Architecture for Embedding Models

Explore the cutting-edge architecture of Vision-Language Models (VLMs) and their application as embedding models in this 19-minute conference talk from Qdrant's Vector Space Day 2025. Discover how transformer architectures enable VLMs to learn from mixed text-image inputs and serve as powerful backbones for embedding models like jina-embeddings-v4. Learn about training insights for VLM-based embedding models that support both dense single-vector and late-interaction multi-vector retrieval across multiple domains, tasks, and languages. Examine the particular strengths of VLMs when processing images containing text and diagrams, UI screenshots, and illustrations. Understand critical factors affecting performance including image resolution, retrieval objectives, and the impact of the modality gap on retrieval effectiveness. Gain insights into model evaluation methodologies and operational efficiency considerations through comparisons of post-training quantization versus quantization-aware training, including trade-offs between model footprint, throughput, and accuracy.