Completed
GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Reinforcement Learning for Large Language Models - RLHF, PPO, DPO, and GRPO
Automatically move to the next video in the Classroom when playback concludes
- 1 Reinforcement Learning with Human Feedback (RLHF) - How to train and fine-tune Transformer Models
- 2 Proximal Policy Optimization (PPO) - How to train Large Language Models
- 3 Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
- 4 GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models
- 5 KL Divergence - How to tell how different two distributions are
- 6 A friendly introduction to deep reinforcement learning, Q-networks and policy gradients