Learn how to benchmark large language models using multiple-choice QA, summarization, and scoring techniques like fuzzy matching, ROUGE, and semantic similarity. Compare GPT models across tasks and dive into internal evaluation with log probabilities and perplexity.
Overview
Syllabus
- Unit 1: Introduction to LLM Benchmarking & Basic QA Evaluation
- Loading and Exploring TriviaQA Dataset
- Text Normalization for Fair Comparisons
- Comparing Answers Beyond Surface Formatting
- Evaluating a Single LLM Response
- Unit 2: Prompting Styles: Zero-shot, One-shot, and Few-shot
- Creating Your First Zero Shot Prompt
- Creating Your First One Shot Prompt
- Making Accuracy Functions More Flexible
- Implementing Few Shot Prompting
- Unit 3: Improving Evaluation: Fuzzy Answer Matching
- Building Your First Fuzzy Matcher
- Evaluating Model Responses with Fuzzy Matching
- Finding the Perfect Similarity Threshold
- Prompting Strategies Showdown with Fuzzy Matching
- Unit 4: Comparing GPT-3.5, GPT-4, and Davinci with Smart Scoring
- Implementing Few Shot Learning with GPT4
- Threshold Impact on Model Evaluation
- Building a Model Performance Leaderboard