Judge Moody's - Automating Semantic Search Relevance Evaluation with LLM Judges

Learn how to automate semantic search relevance evaluation using large language models as judges in this 48-minute conference talk from Haystack US 2025. Discover how Moody's developed an innovative evaluation framework to assess their semantic search engine that retrieves context from millions of financial research documents for their RAG-powered Research Assistant application. Explore the challenges of traditional evaluation methods that rely on expensive and time-consuming domain expert assessments, and understand how LLM-based automated judges can achieve over 80% agreement with human evaluators through iterative prompt tuning, few-shot learning, and explicit evaluation criteria. Examine the technical implementation of a pipeline that compiles test sets, retrieves relevant document chunks, and automatically evaluates relevance using standard information retrieval metrics including precision, recall, and nDCG. Understand how this approach reduces experiment iteration time from days to minutes while maintaining high correlation with expert assessments, enabling rapid algorithm development and testing. Learn about the prompt engineering methodology, validation processes against expert judgments, and practical lessons from applying automated evaluation to specialized financial content. Gain insights into the limitations of current LLM judges when handling highly technical financial concepts and discover ongoing efforts to enhance domain-specific evaluation capabilities through prompt refinement and expert feedback integration.