Quantifying Generalization Complexity for Large Language Models

This 15-minute conference talk by Hongyin Luo from MIT CSAIL introduces SCYLLA, an innovative evaluation framework designed to distinguish between generalization and memorization in Large Language Models (LLMs). Explore how this dynamic evaluation approach quantifies LLMs' generalization capabilities across different complexity levels, revealing significant insights about performance gaps between in-distribution and out-of-distribution data. Learn about the fascinating "generalization valley" phenomenon—a non-monotonic relationship between task complexity and performance that indicates when LLMs begin to rely too heavily on non-generalizable behavior. Discover how critical complexity thresholds shift as model size increases, suggesting larger models can handle more complex reasoning tasks before defaulting to memorization. The presentation covers benchmarking results across 28 popular LLMs, including both open-source models like LLaMA and Qwen, and closed models such as Claude and GPT, providing valuable insights for researchers and practitioners interested in robust evaluation methods for language models.