Inference Scaling: A New Frontier for AI Capability

This lecture by Azalia Mirhoseini from Stanford/DeepMind explores inference compute as an emerging frontier for scaling Large Language Models (LLMs). Discover how "Large Language Monkeys" research demonstrates a predictable log-linear relationship between coverage (problems solved) and the number of inference samples across four orders of magnitude, suggesting the existence of inference-time scaling laws. Learn how these coverage increases translate to improved performance in domains with automatic verification like coding and formal proofs, while identifying correct samples without verifiers remains challenging. Explore the Archon framework, which automatically designs effective inference-time systems by selecting, combining, and stacking operations like repeated sampling, fusion, ranking, and verification to optimize LLM performance across diverse tasks. The talk concludes with hardware acceleration techniques to improve computational efficiency in LLM serving.