This tutorial video delves into the innovative architecture of DeepSeek R1, explaining how it achieves exceptional reasoning capabilities through advanced reinforcement learning techniques. Explore the Group Relative Policy Optimization (GRPO) methodology and understand how it improves upon traditional PPO approaches for AI training. Discover the critical role of KL divergence in maintaining model stability, with practical code demonstrations and clear mathematical explanations throughout the 68-minute session. The comprehensive content covers the complete R1 development pathway, from initial supervised fine-tuning through reinforcement learning with neural reward models to final distillation. Gain insights into consistency rewards for Chain-of-Thought reasoning, data generation techniques, and detailed mathematical formulations with benchmarking results that demonstrate why these approaches lead to superior AI reasoning performance.

Syllabus

⌨️ 0:00:00 Introduction
⌨️ 0:01:49 R1 Overview - Overview
⌨️ 0:03:52 R1 Overview - DeepSeek R1-zero path
⌨️ 0:05:32 R1 Overview - Reinforcement learning setup
⌨️ 0:08:36 R1 Overview - Group Relative Policy Optimization GRPO
⌨️ 0:13:04 R1 Overview - DeepSeek R1-zero result
⌨️ 0:16:53 R1 Overview - Cold start supervised fine-tuning
⌨️ 0:17:44 R1 Overview - Consistency reward for CoT
⌨️ 0:18:35 R1 Overview - Supervised Fine tuning data generation
⌨️ 0:21:06 R1 Overview - Reinforcement learning with neural reward model
⌨️ 0:22:53 R1 Overview - Distillation
⌨️ 0:26:16 GRPO - Overview
⌨️ 0:26:55 GRPO - PPO vs GRPO
⌨️ 0:30:25 GRPO - PPO formula overview
⌨️ 0:33:25 GRPO - GRPO formula overview
⌨️ 0:36:48 GRPO - GRPO pseudo code
⌨️ 0:38:56 GRPO - GRPO Trainer code
⌨️ 0:49:24 KL Divergence - Overview
⌨️ 0:49:55 KL Divergence - KL Divergence in GRPO vs PPO
⌨️ 0:51:20 KL Divergence - KL Divergence refresher
⌨️ 0:55:32 KL Divergence - Monte Carlo estimation of KL divergence
⌨️ 0:56:43 KL Divergence - Schulman blog
⌨️ 0:57:38 KL Divergence - k1 = logq/p
⌨️ 1:00:01 KL Divergence - k2 = 0.5*logp/q^2
⌨️ 1:02:19 KL Divergence - k3 = p/q - 1 - logp/q
⌨️ 1:04:44 KL Divergence - benchmarking
⌨️ 1:07:28 Conclusion