Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) - Math Explained

Explore the mathematical foundations of Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) in this comprehensive 25-minute video tutorial. Delve into these popular reinforcement learning methods that have gained prominence through their application in Large Language Models for post-training alignment with preference data. Begin with an intuitive understanding of the problem statement and initial objective, then progress through the analytical derivation of both algorithms. Master key concepts including return functions, value functions, and importance sampling while understanding how these techniques evolved from Trust Region Policy Optimization (TRPO). Follow the complete mathematical derivation from basic principles to the final objectives, with clear explanations of each step in the process. Access extensive supplementary resources including papers on TRPO, PPO, GRPO, and REINFORCE, along with additional materials on log-derivatives, reinforcement learning fundamentals, and importance sampling to deepen your understanding of these crucial optimization techniques used in modern AI systems.

Syllabus

00:00 Introduction
01:17 Problem Statement
03:17 Intuitive Objective
04:07 Analytically Computable Objective
10:11 Return Function
12:07 Value Function
14:53 Importance Sampling
17:40 TRPO
19:16 PPO
21:15 GRPO
23:45 Summary
24:31 Outro