The Fastest Way to Become a Backend Developer Online
Get 20% off all career paths from fullstack to AI
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore Group Relative Policy Optimization (GRPO), the reinforcement learning technique used by DeepSeek to train its advanced reasoning model, in this 22-minute educational video from Serrano.Academy. Discover how GRPO differs from self-supervised learning by using reinforcement learning for model self-improvement. Compare DeepSeek's reasoning capabilities with ChatGPT through practical examples, understand the GRPO scoring mechanism, and learn about key concepts including context-based answering, quality advantage, response probability calculations, and response clipping. The video breaks down complex technical concepts into accessible explanations, making it part of a broader series on reinforcement learning for large language models. Perfect for those interested in the technical foundations behind cutting-edge AI reasoning systems.
Syllabus
00:00 Introduction
00:26 Answering with context
01:40 DeepSeek vs ChatGPT
05:30 The GRPO score
07:05 Averaging over answers and steps
07:38 Quality Advantage
10:30 Probability of responses
15:36 Clipping the response
18:21 Not changing the model too much
Taught by
Serrano.Academy