MoNaCo - Natural Questions for Deep Reasoning Across Dozens of Documents

Explore a comprehensive research presentation introducing MoNaCo, a groundbreaking benchmark designed to evaluate the question-answering capabilities of large language models when handling complex, multi-document reasoning tasks. Learn about the development of 1,315 challenging information-seeking questions that require synthesizing and reasoning across dozens of Wikipedia tables and passages, addressing a critical gap in current evaluation methodologies. Discover the performance results of 15 frontier LLMs including GPT-5, o3, Claude Opus 4, Gemini 2.5 Pro, and Deepseek-R1, with the top-performing model achieving only 38.7% perfect scores. Understand how this benchmark reveals that factuality remains a significant challenge for LLMs despite the saturation of many existing factual QA benchmarks, and gain insights into the limitations of current AI systems in handling real-world information synthesis problems similar to those tackled by tools like Deep Research.