Length Controlled Policy Optimization for Scaling Reinforcement Learning - CMU Research
Discover AI via YouTube
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore the new Length Controlled Policy Optimization (LCPO) technique in this research video from Carnegie Mellon University. Learn about this simple reinforcement learning method that optimizes for both accuracy and adherence to user-specified length constraints. Discover how CMU researchers applied LCPO to train L1, a reasoning language model capable of producing outputs that satisfy length constraints specified in prompts. Understand how LCPO builds upon Group Relative Policy Optimization (GRPO), a method for scaling reinforcement learning developed by DeepSeekMath/R1. This presentation covers the research paper "L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning" by Pranjal Aggarwal and Sean Welleck from Carnegie Mellon University, offering valuable insights for those interested in AI research, AI agents, and AI policy.
Syllabus
NEW L1 LLM w/ GRPO to LCPO for Scaling RL (CMU)
Taught by
Discover AI