Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Training Small Language Models to Reason with Reinforcement Learning - GRPO from Scratch

Neural Breakdown with AVB via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to implement Group Relative Policy Optimization (GRPO) from scratch in PyTorch to train Small Language Models for reasoning tasks using reinforcement learning. Explore the theoretical foundations of policy gradient equations and PPO loss functions while building a complete training pipeline. Set up reinforcement learning environments using the Reasoning Gym library and understand reward mechanisms for mathematical reasoning tasks. Code response generation systems, reward calculation methods, and advantage estimation techniques for policy optimization. Implement log probability calculations and construct the full RL training loop with visualization of model behavior changes. Master the GRPO and PPO loss functions, including surrogate clipping mechanisms for stable training. Apply supervised fine-tuning techniques with LoRA (Low-Rank Adaptation) for efficient model parameter updates. Examine practical results from reasoning-capable SLMs and discover 10 essential tips for successfully fine-tuning reasoning models. Reference key research papers including DeepSeek Math, DeepSeek R1, DAPO, and critical perspectives on reasoning model development to understand current state-of-the-art approaches in the field.

Syllabus

0:00 - Thinking LLMs are taking over!
3:47 - Setting up Reinforcement Learning Environment
4:50 - Reasoning Gym library - Rewards
8:00 - GRPO Visually explained
10:41 - Policy Optimization and PPO loss Explained
15:45 - Coding response generation
20:55 - Coding Reward Generation & Advantages
26:25 - Calculating log probabilities
30:58 - RL Training loop
33:49 - Visualizing log probabilities post training
36:01 - The GRPO and PPO Loss function
38:19 - Surrogate clipping
41:21 - Supervised Finetuning and LORA training
43:26 - Reasoning SLM results!
45:36 - 10 Practical Tips for finetuning Reasoning SLMs

Taught by

Neural Breakdown with AVB

Reviews

Start your review of Training Small Language Models to Reason with Reinforcement Learning - GRPO from Scratch

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.