Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Watch a technical colloquium talk exploring how compression-based metrics can measure the quality of mechanistic interpretability in AI models. Delve into research findings from studying small transformers trained on Max-of-K tasks, where 102 different computer-assisted proof strategies were developed to assess proof length and bound tightness across 151 models. Learn how shorter proofs correlate with better mechanistic understanding and tighter performance bounds, while examining the challenge of compounding structureless noise in generating compact proofs. Discover ongoing work in relaxing worst-case constraints and fine-tuning partially-interpreted models, along with a roadmap for scaling this approach to frontier models. Explore key concepts including theorem statements, baseline approaches, toy cases, distilling neural networks, and modular arithmetic models through detailed technical discussions and Q&A sessions.
Syllabus
Introduction
Why Metrics
Theorem Statement
Baseline Approach
Brute Force Approach
mechanistic understanding
toy case
current applications
distilling NE networks
compressing proofs
research agenda
QA
Group approach
Model
Brute Force
Insight
Error Term Matrix
Modular Arithmetic Model
Taught by
Topos Institute