Length Controlled Policy Optimization for Scaling Reinforcement Learning - CMU Research
Discover AI via YouTube
UC San Diego Product Management Certificate — AI-Powered PM Training
Google, IBM & Microsoft Certificates — All in One Plan
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore the new Length Controlled Policy Optimization (LCPO) technique in this research video from Carnegie Mellon University. Learn about this simple reinforcement learning method that optimizes for both accuracy and adherence to user-specified length constraints. Discover how CMU researchers applied LCPO to train L1, a reasoning language model capable of producing outputs that satisfy length constraints specified in prompts. Understand how LCPO builds upon Group Relative Policy Optimization (GRPO), a method for scaling reinforcement learning developed by DeepSeekMath/R1. This presentation covers the research paper "L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning" by Pranjal Aggarwal and Sean Welleck from Carnegie Mellon University, offering valuable insights for those interested in AI research, AI agents, and AI policy.
Syllabus
NEW L1 LLM w/ GRPO to LCPO for Scaling RL (CMU)
Taught by
Discover AI