GRPO - Group Relative Policy Optimization: How DeepSeek Trains Reasoning Models

Explore Group Relative Policy Optimization (GRPO), the reinforcement learning technique used by DeepSeek to train its advanced reasoning model, in this 22-minute educational video from Serrano.Academy. Discover how GRPO differs from self-supervised learning by using reinforcement learning for model self-improvement. Compare DeepSeek's reasoning capabilities with ChatGPT through practical examples, understand the GRPO scoring mechanism, and learn about key concepts including context-based answering, quality advantage, response probability calculations, and response clipping. The video breaks down complex technical concepts into accessible explanations, making it part of a broader series on reinforcement learning for large language models. Perfect for those interested in the technical foundations behind cutting-edge AI reasoning systems.

Syllabus

00:00 Introduction
00:26 Answering with context
01:40 DeepSeek vs ChatGPT
05:30 The GRPO score
07:05 Averaging over answers and steps
07:38 Quality Advantage
10:30 Probability of responses
15:36 Clipping the response
18:21 Not changing the model too much