Tuning LLM Judge Design Decisions for 1/1000 of the Cost

Learn how to systematically optimize Large Language Model (LLM) judges for evaluating model outputs at dramatically reduced costs through this 42-minute AutoML Seminars presentation. Discover the challenges of expensive human annotations in LLM evaluation and explore how LLM-based judges can rank models without human intervention by comparing outputs between different LLMs. Examine the confounding factors that make fair comparisons difficult across different research papers, including variations in models, prompts, and hyperparameters that are often changed simultaneously. Master a systematic approach to analyzing and tuning LLM judge hyperparameters using multi-objective multi-fidelity optimization techniques that balance accuracy against computational cost while significantly reducing search expenses. Understand how this methodology identifies judges that outperform existing benchmarks in both accuracy and cost-efficiency while utilizing open-weight models for enhanced accessibility and reproducibility. Access the accompanying research paper and implementation code to apply these cost-effective evaluation strategies in your own LLM projects and research.