Group Relative Policy Optimization (GRPO) - Formula and Implementation Tutorial

Learn about Group Relative Policy Optimization (GRPO), a key algorithm powering the DeepSeek R1 architecture, through a detailed tutorial that breaks down both the mathematical formulas and practical implementation. Explore the differences between PPO and GRPO algorithms, understand their respective formulas, and follow along with a comprehensive code walkthrough featuring the HuggingFace post-training team's implementation. Dive into detailed explanations spanning from theoretical foundations to practical pseudo-code and actual trainer code implementation. Access additional resources including HuggingFace documentation, GitHub repositories, the DeepSeek Math paper, and complementary tutorials to deepen your understanding of GRPO and PPO concepts. Perfect for machine learning practitioners and researchers interested in advanced optimization techniques in AI model training.

Syllabus

- Introduction: 0:00
- PPO vs GRPO: 1:18
- PPO formula overview: 4:24
- GRPO formula overview: 7:49
- GRPO pseudo code: 11:11
- GRPO Trainer code: 13:21
- Conclusion: 23:48