DeepSeek R1 Theory Tutorial - Architecture, GRPO, KL Divergence

DeepSeek R1 Theory Tutorial - Architecture, GRPO, KL Divergence

freeCodeCamp.org via freeCodeCamp Direct link

⌨️ 0:51:20 KL Divergence - KL Divergence refresher

20 of 27

20 of 27

⌨️ 0:51:20 KL Divergence - KL Divergence refresher

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

DeepSeek R1 Theory Tutorial - Architecture, GRPO, KL Divergence

Automatically move to the next video in the Classroom when playback concludes

  1. 1 ⌨️ 0:00:00 Introduction
  2. 2 ⌨️ 0:01:49 R1 Overview - Overview
  3. 3 ⌨️ 0:03:52 R1 Overview - DeepSeek R1-zero path
  4. 4 ⌨️ 0:05:32 R1 Overview - Reinforcement learning setup
  5. 5 ⌨️ 0:08:36 R1 Overview - Group Relative Policy Optimization GRPO
  6. 6 ⌨️ 0:13:04 R1 Overview - DeepSeek R1-zero result
  7. 7 ⌨️ 0:16:53 R1 Overview - Cold start supervised fine-tuning
  8. 8 ⌨️ 0:17:44 R1 Overview - Consistency reward for CoT
  9. 9 ⌨️ 0:18:35 R1 Overview - Supervised Fine tuning data generation
  10. 10 ⌨️ 0:21:06 R1 Overview - Reinforcement learning with neural reward model
  11. 11 ⌨️ 0:22:53 R1 Overview - Distillation
  12. 12 ⌨️ 0:26:16 GRPO - Overview
  13. 13 ⌨️ 0:26:55 GRPO - PPO vs GRPO
  14. 14 ⌨️ 0:30:25 GRPO - PPO formula overview
  15. 15 ⌨️ 0:33:25 GRPO - GRPO formula overview
  16. 16 ⌨️ 0:36:48 GRPO - GRPO pseudo code
  17. 17 ⌨️ 0:38:56 GRPO - GRPO Trainer code
  18. 18 ⌨️ 0:49:24 KL Divergence - Overview
  19. 19 ⌨️ 0:49:55 KL Divergence - KL Divergence in GRPO vs PPO
  20. 20 ⌨️ 0:51:20 KL Divergence - KL Divergence refresher
  21. 21 ⌨️ 0:55:32 KL Divergence - Monte Carlo estimation of KL divergence
  22. 22 ⌨️ 0:56:43 KL Divergence - Schulman blog
  23. 23 ⌨️ 0:57:38 KL Divergence - k1 = logq/p
  24. 24 ⌨️ 1:00:01 KL Divergence - k2 = 0.5*logp/q^2
  25. 25 ⌨️ 1:02:19 KL Divergence - k3 = p/q - 1 - logp/q
  26. 26 ⌨️ 1:04:44 KL Divergence - benchmarking
  27. 27 ⌨️ 1:07:28 Conclusion

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.