Overview
Master practical LLM evaluation by benchmarking models on QA and text generation, using metrics like fuzzy matching, ROUGE, and semantic similarity. Learn to analyze logprobs, perplexity, and model behavior for robust, real-world NLP assessment.
Syllabus
- Course 1: Benchmarking LLMs with QA
- Course 2: Benchmarking LLMs on Text Generation
- Course 3: Scoring LLM Outputs with Logprobs and Perplexity
- Course 4: Behavioral Benchmarking of LLMs
Courses
-
Learn how to benchmark large language models using multiple-choice QA, summarization, and scoring techniques like fuzzy matching, ROUGE, and semantic similarity. Compare GPT models across tasks and dive into internal evaluation with log probabilities and perplexity.
-
This course explores benchmarking for open-ended generation tasks like summarization. You'll experiment with different prompting styles, compare models like GPT-3.5 and GPT-4, and evaluate results using both fuzzy string similarity and semantic similarity via embeddings.
-
In this course, you'll explore how to evaluate the fluency and likelihood of LLM outputs using internal scoring signals like log probabilities and perplexity. You'll work with OpenAI's completion models to analyze how models "think" under the hood. This course builds naturally on the first two by focusing on model-internal evaluation instead of external references.
-
In this course, you’ll experiment with deeper aspects of LLM evaluation: token usage efficiency, temperature sensitivity, model output consistency, and detecting hallucinations. Through lightweight API experiments, you’ll develop intuition for how models behave beyond accuracy scores.