Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

freeCodeCamp

DeepSeek R1 Theory Tutorial - Architecture, GRPO, KL Divergence

via freeCodeCamp

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This tutorial video delves into the innovative architecture of DeepSeek R1, explaining how it achieves exceptional reasoning capabilities through advanced reinforcement learning techniques. Explore the Group Relative Policy Optimization (GRPO) methodology and understand how it improves upon traditional PPO approaches for AI training. Discover the critical role of KL divergence in maintaining model stability, with practical code demonstrations and clear mathematical explanations throughout the 68-minute session. The comprehensive content covers the complete R1 development pathway, from initial supervised fine-tuning through reinforcement learning with neural reward models to final distillation. Gain insights into consistency rewards for Chain-of-Thought reasoning, data generation techniques, and detailed mathematical formulations with benchmarking results that demonstrate why these approaches lead to superior AI reasoning performance.

Syllabus

⌨️ 0:00:00 Introduction
⌨️ 0:01:49 R1 Overview - Overview
⌨️ 0:03:52 R1 Overview - DeepSeek R1-zero path
⌨️ 0:05:32 R1 Overview - Reinforcement learning setup
⌨️ 0:08:36 R1 Overview - Group Relative Policy Optimization GRPO
⌨️ 0:13:04 R1 Overview - DeepSeek R1-zero result
⌨️ 0:16:53 R1 Overview - Cold start supervised fine-tuning
⌨️ 0:17:44 R1 Overview - Consistency reward for CoT
⌨️ 0:18:35 R1 Overview - Supervised Fine tuning data generation
⌨️ 0:21:06 R1 Overview - Reinforcement learning with neural reward model
⌨️ 0:22:53 R1 Overview - Distillation
⌨️ 0:26:16 GRPO - Overview
⌨️ 0:26:55 GRPO - PPO vs GRPO
⌨️ 0:30:25 GRPO - PPO formula overview
⌨️ 0:33:25 GRPO - GRPO formula overview
⌨️ 0:36:48 GRPO - GRPO pseudo code
⌨️ 0:38:56 GRPO - GRPO Trainer code
⌨️ 0:49:24 KL Divergence - Overview
⌨️ 0:49:55 KL Divergence - KL Divergence in GRPO vs PPO
⌨️ 0:51:20 KL Divergence - KL Divergence refresher
⌨️ 0:55:32 KL Divergence - Monte Carlo estimation of KL divergence
⌨️ 0:56:43 KL Divergence - Schulman blog
⌨️ 0:57:38 KL Divergence - k1 = logq/p
⌨️ 1:00:01 KL Divergence - k2 = 0.5*logp/q^2
⌨️ 1:02:19 KL Divergence - k3 = p/q - 1 - logp/q
⌨️ 1:04:44 KL Divergence - benchmarking
⌨️ 1:07:28 Conclusion

Taught by

freeCodeCamp.org

Reviews

Start your review of DeepSeek R1 Theory Tutorial - Architecture, GRPO, KL Divergence

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.