Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CodeSignal

Benchmarking LLMs with QA

via CodeSignal

Overview

Learn how to benchmark large language models using multiple-choice QA, summarization, and scoring techniques like fuzzy matching, ROUGE, and semantic similarity. Compare GPT models across tasks and dive into internal evaluation with log probabilities and perplexity.

Syllabus

  • Unit 1: Introduction to LLM Benchmarking & Basic QA Evaluation
    • Loading and Exploring TriviaQA Dataset
    • Text Normalization for Fair Comparisons
    • Comparing Answers Beyond Surface Formatting
    • Evaluating a Single LLM Response
  • Unit 2: Prompting Styles: Zero-shot, One-shot, and Few-shot
    • Creating Your First Zero Shot Prompt
    • Creating Your First One Shot Prompt
    • Making Accuracy Functions More Flexible
    • Implementing Few Shot Prompting
  • Unit 3: Improving Evaluation: Fuzzy Answer Matching
    • Building Your First Fuzzy Matcher
    • Evaluating Model Responses with Fuzzy Matching
    • Finding the Perfect Similarity Threshold
    • Prompting Strategies Showdown with Fuzzy Matching
  • Unit 4: Comparing GPT-3.5, GPT-4, and Davinci with Smart Scoring
    • Implementing Few Shot Learning with GPT4
    • Threshold Impact on Model Evaluation
    • Building a Model Performance Leaderboard

Reviews

Start your review of Benchmarking LLMs with QA

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.