Reinforcement Learning with Verifiable Rewards - RLVR Environments for LLMs
Yacine Mahdid via YouTube
Overview
Syllabus
00:00 - Introduction: RL’s growing role in agentic AI
01:10 - The RLVR loop: dataset, policy, rollouts, rewards, updates
02:13 - Overview of the state of RLVR
03:50 - Small-model RLVR: performance, latency, and cost benefits
06:00 - RLVR vs RLHF: key conceptual differences
07:32 - Open-source frameworks: ReasoningGym, ART, TRL and Verifiers
08:12 - deep dive into the verifiers 7 steps with math-python env
08:25 - deep dive into the verifiers | step 1 : data
09:09 - deep dive into the verifiers | step 2 : interaction style
09:40 - deep dive into the verifiers | step 3 : environment logic
10:05 - deep dive into the verifiers | step 4 : rewards function rubric
11:23 - deep dive into the verifiers | step 5 : parser optional
11:46 - deep dive into the verifiers | step 6 : package environment
12:07 - deep dive into the verifiers | step 7 : run eval or training
12:30 - a few community environments
13:25 - Case study: Building a Vision-Language RLVR environment feat alexine
13:56 - vision SR1 - overview
16:46 - vision SR1 - environment 1
18:29 - vision SR1 - environment 2
20:03 - Interview with prime Will Brown, creator of Verifiers
20:18 - Interview with prime Will Brown - verifiers development story
23:16 - Interview with prime Will Brown - what's the vision for environment hub?
24:17 - Interview with prime Will Brown - what future is there for RL environment?
26:27 -
Taught by
Yacine Mahdid