Google Data Analytics, IBM AI & Meta Marketing — All in One Subscription
Build AI Apps with Azure, Copilot, and Generative AI — Microsoft Certified
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore the technical details of DeepSeek R1's reinforcement learning implementation in this 35-minute technical video that breaks down the Group Relative Policy Optimization (GRPO) approach. Learn about the evolution from PPO to GRPO, memory optimization techniques, and the importance of group relative advantages in AI model training. Dive deep into key concepts including KL-divergence, reward signal implementation, and practical applications like training a Rust reasoner. Through detailed chapter breakdowns, understand how GRPO improves upon traditional reinforcement learning methods while addressing memory usage challenges in large-scale AI model development.
Syllabus
0:00 Intro
0:52 Recap of R1
2:35 Why is GRPO Important
3:41 From PPO to GRPO
7:31 Reducing Memory Usage with GRPO
12:23 Group Relatives Advantages
20:41 KL-Divergence
27:53 The Reward Signals
31:09 Training a Rust Reasoner
Taught by
Oxen