This course explores benchmarking for open-ended generation tasks like summarization. You'll experiment with different prompting styles, compare models like GPT-3.5 and GPT-4, and evaluate results using both fuzzy string similarity and semantic similarity via embeddings.
Overview
Syllabus
- Unit 1: Prompting for Summarization with LLMs
- Reading CSV Files with Python
- Simplifying Prompts for Better Summaries
- Crafting Custom Prompts for Better Summaries
- Unit 2: Scoring and Comparing Models with ROUGE
- Loading Data for ROUGE Evaluation
- Setting Up ROUGE for Summary Evaluation
- Evaluating Summaries with ROUGE Metrics
- Evaluating GPT-4 Summarization with ROUGE-L
- Unit 3: Semantic Evaluation with Embeddings
- Implementing Cosine Similarity for Vector Comparison
- Creating Text Embeddings with OpenAI
- Building a Semantic Comparison Pipeline