Training Small Language Models to Reason with Reinforcement Learning - GRPO from Scratch

Learn to implement Group Relative Policy Optimization (GRPO) from scratch in PyTorch to train Small Language Models for reasoning tasks using reinforcement learning. Explore the theoretical foundations of policy gradient equations and PPO loss functions while building a complete training pipeline. Set up reinforcement learning environments using the Reasoning Gym library and understand reward mechanisms for mathematical reasoning tasks. Code response generation systems, reward calculation methods, and advantage estimation techniques for policy optimization. Implement log probability calculations and construct the full RL training loop with visualization of model behavior changes. Master the GRPO and PPO loss functions, including surrogate clipping mechanisms for stable training. Apply supervised fine-tuning techniques with LoRA (Low-Rank Adaptation) for efficient model parameter updates. Examine practical results from reasoning-capable SLMs and discover 10 essential tips for successfully fine-tuning reasoning models. Reference key research papers including DeepSeek Math, DeepSeek R1, DAPO, and critical perspectives on reasoning model development to understand current state-of-the-art approaches in the field.

Syllabus

0:00 - Thinking LLMs are taking over!
3:47 - Setting up Reinforcement Learning Environment
4:50 - Reasoning Gym library - Rewards
8:00 - GRPO Visually explained
10:41 - Policy Optimization and PPO loss Explained
15:45 - Coding response generation
20:55 - Coding Reward Generation & Advantages
26:25 - Calculating log probabilities
30:58 - RL Training loop
33:49 - Visualizing log probabilities post training
36:01 - The GRPO and PPO Loss function
38:19 - Surrogate clipping
41:21 - Supervised Finetuning and LORA training
43:26 - Reasoning SLM results!
45:36 - 10 Practical Tips for finetuning Reasoning SLMs