Length Controlled Policy Optimization for Scaling Reinforcement Learning - CMU Research
Discover AI via YouTube
The Fastest Way to Become a Backend Developer Online
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore the new Length Controlled Policy Optimization (LCPO) technique in this research video from Carnegie Mellon University. Learn about this simple reinforcement learning method that optimizes for both accuracy and adherence to user-specified length constraints. Discover how CMU researchers applied LCPO to train L1, a reasoning language model capable of producing outputs that satisfy length constraints specified in prompts. Understand how LCPO builds upon Group Relative Policy Optimization (GRPO), a method for scaling reinforcement learning developed by DeepSeekMath/R1. This presentation covers the research paper "L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning" by Pranjal Aggarwal and Sean Welleck from Carnegie Mellon University, offering valuable insights for those interested in AI research, AI agents, and AI policy.
Syllabus
NEW L1 LLM w/ GRPO to LCPO for Scaling RL (CMU)
Taught by
Discover AI