Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How DeepSeek R1's Reinforcement Learning Works Through GRPO

Oxen via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the technical details of DeepSeek R1's reinforcement learning implementation in this 35-minute technical video that breaks down the Group Relative Policy Optimization (GRPO) approach. Learn about the evolution from PPO to GRPO, memory optimization techniques, and the importance of group relative advantages in AI model training. Dive deep into key concepts including KL-divergence, reward signal implementation, and practical applications like training a Rust reasoner. Through detailed chapter breakdowns, understand how GRPO improves upon traditional reinforcement learning methods while addressing memory usage challenges in large-scale AI model development.

Syllabus

0:00 Intro
0:52 Recap of R1
2:35 Why is GRPO Important
3:41 From PPO to GRPO
7:31 Reducing Memory Usage with GRPO
12:23 Group Relatives Advantages
20:41 KL-Divergence
27:53 The Reward Signals
31:09 Training a Rust Reasoner

Taught by

Oxen

Reviews

Start your review of How DeepSeek R1's Reinforcement Learning Works Through GRPO

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.