Gain a Splash of New Skills - Coursera+ Annual Nearly 45% Off
Free courses from frontend to fullstack and AI
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a 25-minute video examining the critical challenge of policy collapse in self-supervised reinforcement learning for large language models and discover a novel solution through momentum-anchored policy optimization. Learn how the frontier of LLM research has shifted toward post-training and System 2 reasoning, where the goal is to replicate O1-level performance by moving beyond supervised fine-tuning to embrace reinforcement learning with verifiable rewards. Understand why current self-supervised methods face a fundamental instability where models begin to "game" the reward signal as they train on their own pseudo-labels, leading to overconfidence, entropy collapse, and degraded reasoning performance. Examine the mathematical evidence showing that the standard industry approach of scaling rollout samples only delays but cannot prevent this inevitable crash. Discover the M-GRPO (Momentum-Anchored Policy Optimization) framework that fundamentally changes how models interact with their training history to bypass policy collapse entirely and achieve state-of-the-art performance where previous baselines failed. Gain insights into how this architectural approach enables truly self-supervised reinforcement learning where models can generate their own questions, verify reasoning chains, and improve indefinitely without expensive human annotations, based on research from Shanghai Innovation Institute, Fudan University, Shanghai AI Laboratory, and The Chinese University of Hong Kong.
Syllabus
Self Learning AI: Accelerate w/ new RL
Taught by
Discover AI