How to Benchmark Embedding Models On Your Own Data

Learn to evaluate and compare embedding models using your own datasets through comprehensive benchmarking techniques. Master the process of extracting text from PDF files using Python libraries and overcome their limitations by leveraging Vision Language Models (VLMs). Discover how to segment extracted text into meaningful chunks that maintain contextual integrity, then generate relevant questions for each chunk using Large Language Models. Explore both open-source and proprietary embedding models to create vector representations of text chunks and questions, including running models locally using llama.cpp with GGUF format. Develop proficiency in benchmarking different embedding models through various metrics and statistical tests using the ranx library, while learning to visualize vector representations to identify cluster formations. Gain understanding of statistical interpretation, particularly p-value analysis from statistical tests. Access comprehensive learning materials including slides, notebooks, and scripts through the provided GitHub repository, along with a specialized dataset for hands-on practice.

Syllabus

About the course
Introduction
Extracting text from PDF documents
Divide text into coherent chunks
Generate question-answer pairs from text chunks
Embed text chunks and questions
Statistical tests and metrics
Expanding the dataset and adding more languages
Conclusion