GPT-OSS Has a Hidden Confidence Switch - Inside GPT-OSS

Explore the hidden confidence routing mechanism in GPT models through mechanistic interpretability analysis of Layer 15. Discover how the model makes strategic decisions about problem difficulty, operation type, and retrieval confidence before performing any actual computation or retrieval. Learn to identify and manipulate specific neurons that control the separation between mathematical and language tasks, select between different operations, and determine problem difficulty gradients. Master techniques for finding, ablating, and steering neurons to demonstrate how Layer 15's confidence routing affects downstream layers 19-21, revealing that the confidence displayed in model outputs represents internal routing decisions rather than post-computation verification.

Syllabus

- Introduction
- Overview of Layers
- Layer 15: Overview
- Confidence of Facts
- Uncertainity
- The signal dictionary of Layer 15
- Finding the Neurons?
- Ablating Neurons
- Steering Neurons