Evaluating Language Models for Mathematics Through Interactive Problem-Solving

Watch a Harvard CMSA seminar presentation where Katherine Collins and Albert Jiang from the University of Cambridge discuss their research on evaluating large language models (LLMs) for mathematical problem-solving through interactive assessment. Explore the development of CheckMate, a prototype platform designed to facilitate human-LLM interactions and evaluation in mathematical contexts. Learn about their comparative study of InstructGPT, ChatGPT, and GPT-4 as mathematical proof assistants, involving participants ranging from undergraduate students to mathematics professors. Discover key insights from their MathConverse dataset, including a taxonomy of human behaviors and the relationship between correctness and perceived helpfulness in LLM responses. Gain valuable perspectives on the practical applications and limitations of LLMs in mathematical reasoning, with particular attention to GPT-4's capabilities as analyzed through expert mathematician case studies. Understand important considerations for both machine learning practitioners and mathematicians, including the benefits of models that effectively communicate uncertainty, respond to corrections, and maintain interpretability and conciseness.