DeepSeek R1 Theory Overview - From GRPO to Reinforcement Learning and Supervised Fine-Tuning

Learn about the training methodology behind DeepSeek R1 in this detailed tutorial video that breaks down the complex paper into digestible segments. Explore the complete training pipeline, starting with the R1-zero path and progressing through reinforcement learning setups, Group Relative Policy Optimization (GRPO), supervised fine-tuning techniques, and neural reward models. Gain insights into cold start supervised fine-tuning, consistency rewards for Chain of Thought (CoT), data generation processes, and the final distillation phase. Follow along with a helpful visualization map while understanding each component of the training process, complemented by references to additional resources for deeper understanding of specific concepts like GRPO.

Syllabus

- Introduction: 0:00
- DeepSeek R1-zero path: 2:23
- Reinforcement learning setup: 3:59
- Group Relative Policy Optimization GRPO: 7:03
- DeepSeek R1-zero result: 11:40
- Cold start supervised fine-tuning: 15:30
- Consistency reward for CoT: 16:19
- Supervised Fine tuning data generation: 17:17
- Reinforcement learning with neural reward model: 19:47
- Distillation: 21:26
- Conclusion: 24:34