METR's Benchmarks vs Economics - The AI Capability Measurement Gap

Explore the disconnect between AI benchmark performance and real-world productivity in this 21-minute conference talk that examines why impressive lab results don't always translate to practical impact. Discover METR's innovative methodology for measuring AI capabilities through human time horizons rather than traditional benchmark scores, and learn how this approach reveals a more nuanced picture of AI performance. Analyze empirical results comparing models like Claude 3 Opus and o1-preview, while understanding why high-performing AI systems in controlled environments may not significantly speed up experienced developers' work in field studies. Examine the limitations of current benchmarking approaches, including saturation issues and interpretation challenges, before diving into METR's capability curve fitting methodology. Investigate findings from randomized controlled trials that highlight the gap between laboratory measurements and real-world developer productivity, exploring potential explanations including reliability requirements, task distribution differences, and capability elicitation challenges. Consider the implications for automated AI research and development, while gaining insights into context dependency, task interdependence, and the complex factors that influence AI's practical utility in professional software development environments.

Syllabus

Introduction to METR & The Capability Gap
The Problem with Current Benchmarks Saturation & Interpretation
METR’s New Methodology: Human Time Horizons
Empirical Results: Fitting Capability Curves
Time Horizon Trends: Claude 3 Opus vs. o1-preview
Randomized Controlled Trial RCT Discussion
Reconciling the Gap: Why High Benchmarks Don't Mean High Productivity
Explaining the Discrepancy: Context, Reliability, and Task Interdependence
Future Work & Hiring at METR