In this course, you’ll experiment with deeper aspects of LLM evaluation: token usage efficiency, temperature sensitivity, model output consistency, and detecting hallucinations. Through lightweight API experiments, you’ll develop intuition for how models behave beyond accuracy scores.
Overview
Syllabus
- Unit 1: Measuring and Interpreting Token Usage in LLMs
- Comparing Token Counts to Prompt and Answer Lengths
- Exploring Prompt Length and Token Usage
- Refactoring Token Usage for Cleaner Code
- Unit 2: Exploring Temperature Sensitivity in LLM Outputs
- Comparing Low and High Temperature Outputs
- Exploring the Temperature Creativity Spectrum
- Comparing Models at Same Temperature
- Unit 3: Measuring Model Consistency Across Reruns
- Refactoring for Cleaner Consistency Checks
- Parameterizing Consistency Test Runs
- Tracking Response Patterns with Frequency Analysis
- Unit 4: Using LLMs as Fact-Checkers for Hallucination Detection
- Generating Answers with GPT Models
- Building a Complete Fact-Checking Pipeline
- Building a Complete Fact Checking Pipeline
- Organizing Fact Check Results for Clarity