Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Center for Language & Speech Processing(CLSP), JHU via YouTube
Our career paths help you become job ready faster
AI Product Expert Certification - Master Generative AI Skills
Overview
Why Pay Per Course When You Can Get All of Coursera for 40% Off?
10,000+ courses, Google, IBM & Meta certificates, one annual plan at 40% off. Upgrade now.
Get Full Access
Watch a 14-minute award-winning conference presentation from Johns Hopkins University's Center for Language & Speech Processing that explores the complexities of knowledge cutoff dates in Large Language Models (LLMs). Dive into the critical distinction between reported and effective cutoff dates for training data, and understand why this matters for applications requiring current information. Learn about a novel approach to estimate effective cutoffs at the resource level by probing across different data versions, without needing access to pre-training data. Discover key findings that reveal significant discrepancies between reported and effective cutoffs, attributed to temporal misalignments in CommonCrawl data and complications in LLM deduplication schemes. Gain valuable insights into why cutoff dates are more nuanced than previously thought, and understand the implications for both LLM dataset curators and practitioners implementing these models.
Syllabus
Dated Data: Tracing Knowledge Cutoffs in Large Language Models (COLM 2024 Outstanding Paper Award)
Taught by
Center for Language & Speech Processing(CLSP), JHU