The Pitfalls of Next-token Prediction in Language Models

Explore a thought-provoking lecture that delves into the limitations of next-token prediction in modeling human intelligence. Examine the critical distinction between autoregressive inference and teacher-forced training in language models. Discover why the popular criticism of error compounding during autoregressive inference may overlook a more fundamental issue: the potential failure of teacher-forcing to learn accurate next-token predictors for certain task classes. Investigate a general mechanism of teacher-forcing failure and analyze empirical evidence from a minimal planning task where both Transformer and Mamba architectures struggle. Consider the potential benefits of training models to predict multiple tokens in advance as a possible solution. Gain insights that can inform future debates and inspire research beyond the current next-token prediction paradigm in artificial intelligence.