Explore the development and evolution of large language model evaluation through Weights & Biases' comprehensive Nejumi LLM Leaderboard in this 20-minute conference talk. Discover how W&B has conducted systematic performance evaluations of LLMs since 2023, continuously publishing results that have become Japan's largest evaluation platform and a key reference for researchers and companies. Learn about the iterative development process from the initial version through the latest version 4, understanding how the leaderboard has adapted to advancements in evaluation techniques and model design. Gain insights from actual operational experience and explore future prospects for LLM evaluation methodologies and benchmarking standards in the rapidly evolving field of artificial intelligence.