Coding Evals - From Code Snippets to Codebases

Explore the evolution of AI coding capabilities in this conference talk that traces the journey from simple code snippet generation to complete codebase development with agentic workflows. Learn about early testable coding benchmarks and discover key insights about contamination and distributional overfitting challenges. Examine repository-grounded coding problems including SWE-bench style bug fixing and R2E's automated function completion approaches. Dive into longer-horizon programming tasks such as runtime optimization through GSO, code translation via Syzygy, and refactoring processes while understanding critical challenges like test hacking, code quality assessment, and maintaining code idiomaticity. Discover evaluation methodologies that extend beyond code generation to include human preference evaluation in conversational coding through LMArena RepoChat and developer preference signals captured in-IDE environments via Copilot Arena. Gain practical insights from Cursor's engineering perspective on building and evaluating AI coding systems that can handle increasingly complex software development tasks.