LLM Evaluation Techniques in Practice

Overview

Master practical LLM evaluation by benchmarking models on QA and text generation, using metrics like fuzzy matching, ROUGE, and semantic similarity. Learn to analyze logprobs, perplexity, and model behavior for robust, real-world NLP assessment.

Syllabus

Course 1: Benchmarking LLMs with QA
Course 2: Benchmarking LLMs on Text Generation
Course 3: Scoring LLM Outputs with Logprobs and Perplexity
Course 4: Behavioral Benchmarking of LLMs

Courses

0 reviews

View details

Learn how to benchmark large language models using multiple-choice QA, summarization, and scoring techniques like fuzzy matching, ROUGE, and semantic similarity. Compare GPT models across tasks and dive into internal evaluation with log probabilities and perplexity.
0 reviews

View details

This course explores benchmarking for open-ended generation tasks like summarization. You'll experiment with different prompting styles, compare models like GPT-3.5 and GPT-4, and evaluate results using both fuzzy string similarity and semantic similarity via embeddings.
0 reviews

View details

In this course, you'll explore how to evaluate the fluency and likelihood of LLM outputs using internal scoring signals like log probabilities and perplexity. You'll work with OpenAI's completion models to analyze how models "think" under the hood. This course builds naturally on the first two by focusing on model-internal evaluation instead of external references.
0 reviews

View details

In this course, you’ll experiment with deeper aspects of LLM evaluation: token usage efficiency, temperature sensitivity, model output consistency, and detecting hallucinations. Through lightweight API experiments, you’ll develop intuition for how models behave beyond accuracy scores.