Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Watch a 14-minute award-winning conference presentation from Johns Hopkins University's Center for Language & Speech Processing that explores the complexities of knowledge cutoff dates in Large Language Models (LLMs). Dive into the critical distinction between reported and effective cutoff dates for training data, and understand why this matters for applications requiring current information. Learn about a novel approach to estimate effective cutoffs at the resource level by probing across different data versions, without needing access to pre-training data. Discover key findings that reveal significant discrepancies between reported and effective cutoffs, attributed to temporal misalignments in CommonCrawl data and complications in LLM deduplication schemes. Gain valuable insights into why cutoff dates are more nuanced than previously thought, and understand the implications for both LLM dataset curators and practitioners implementing these models.