Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Benchmarking LLMs with QA

Go to class Write review

Details

Provider

CodeSignal
Pricing

Free Certificate
Languages

English
Certificate

Certificate Available
Effort

1 hour
Sessions

Self-Paced
Level

Intermediate

Found in

Part of

LLM Evaluation Techniques in Practice

Overview

Learn how to benchmark large language models using multiple-choice QA, summarization, and scoring techniques like fuzzy matching, ROUGE, and semantic similarity. Compare GPT models across tasks and dive into internal evaluation with log probabilities and perplexity.

Syllabus

Unit 1: Introduction to LLM Benchmarking & Basic QA Evaluation

Loading and Exploring TriviaQA Dataset
Text Normalization for Fair Comparisons
Comparing Answers Beyond Surface Formatting
Evaluating a Single LLM Response

Unit 2: Prompting Styles: Zero-shot, One-shot, and Few-shot

Creating Your First Zero Shot Prompt
Creating Your First One Shot Prompt
Making Accuracy Functions More Flexible
Implementing Few Shot Prompting

Unit 3: Improving Evaluation: Fuzzy Answer Matching

Building Your First Fuzzy Matcher
Evaluating Model Responses with Fuzzy Matching
Finding the Perfect Similarity Threshold
Prompting Strategies Showdown with Fuzzy Matching

Unit 4: Comparing GPT-3.5, GPT-4, and Davinci with Smart Scoring

Implementing Few Shot Learning with GPT4
Threshold Impact on Model Evaluation
Building a Model Performance Leaderboard

Reviews

Start your review of Benchmarking LLMs with QA