Training Agentic Reasoners - Reinforcement Learning for Multi-Turn Tool Calling

Explore reinforcement learning techniques for training agentic reasoning systems through this technical conference talk that examines multi-turn tool calling approaches similar to OpenAI's o3 and Deep Research. Discover when, why, and how to implement RL for agentic reasoning, comparing GRPO versus PPO and other methodologies. Learn about designing effective environments and reward systems for training agents, while reviewing recent research highlights and examining results on example tasks. Understand the complexities and challenges of implementing RL in practice, including the connection between popular AI products and RL fine-tuning processes. Examine the importance of tools and real-world tasks for agents, address the critical problem of "reward hacking," and learn strategies for designing better evaluations. Gain insights into future directions for agentic systems and access a practical toolkit for implementation, including an overview of the open-source ecosystem with libraries, compute requirements, and tradeoffs. The presentation covers the core processes of reinforcement learning and demonstrates how reasoning and agents share similar underlying principles, providing both theoretical foundations and practical guidance for developing sophisticated AI systems.

Syllabus

[00:00] Introduction to the idea that reasoning and agents are similar.
[01:05] The growing effectiveness of Reinforcement Learning RL in AI.
[03:04] The complexities and challenges of implementing RL.
[04:41] The connection between popular AI products agents and RL fine-tuning.
[07:18] The core process of Reinforcement Learning.
[10:21] The importance of tools and real-world tasks for agents.
[12:13] The problem of "reward hacking" and how to design better evaluations.
[14:51] Future directions for agentic systems and a practical toolkit for implementation.