The TinyStories Dataset - Training Small Language Models for Coherent Text Generation
Harvard CMSA via YouTube
Start speaking a new language. It’s just 3 weeks away.
The Fastest Way to Become a Backend Developer Online
Overview
Build a Learning Habit
Download Class Central's free printable study calendar
Download for Free
Explore a seminar presentation from Microsoft Research's Ronen Eldan at Harvard CMSA's New Technologies in Mathematics series that investigates the fascinating question of minimal size requirements for coherent language models. Delve into the innovative TinyStories dataset, a synthetic collection of children's stories using vocabulary comprehensible to 3-4 year olds, generated using GPT-3.5/4. Learn how this specialized dataset enables training of remarkably small language models (under 10 million parameters) that can still produce fluent, grammatically sound, and consistent multi-paragraph stories. Discover how these simplified models demonstrate reasoning capabilities while offering unprecedented interpretability through visualizable attention and activation patterns. Understand the implications of this research for developing efficient language models in low-resource environments and specialized domains, while gaining insights into how language capabilities emerge in these systems.
Syllabus
Ronen Eldan | The TinyStories Dataset: How Small Can Language Models Be And Still Speak Coherent
Taught by
Harvard CMSA