Start speaking a new language. It’s just 3 weeks away.
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore a 25-minute video examining the critical challenge of policy collapse in self-supervised reinforcement learning for large language models and discover a novel solution through momentum-anchored policy optimization. Learn how the frontier of LLM research has shifted toward post-training and System 2 reasoning, where the goal is to replicate O1-level performance by moving beyond supervised fine-tuning to embrace reinforcement learning with verifiable rewards. Understand why current self-supervised methods face a fundamental instability where models begin to "game" the reward signal as they train on their own pseudo-labels, leading to overconfidence, entropy collapse, and degraded reasoning performance. Examine the mathematical evidence showing that the standard industry approach of scaling rollout samples only delays but cannot prevent this inevitable crash. Discover the M-GRPO (Momentum-Anchored Policy Optimization) framework that fundamentally changes how models interact with their training history to bypass policy collapse entirely and achieve state-of-the-art performance where previous baselines failed. Gain insights into how this architectural approach enables truly self-supervised reinforcement learning where models can generate their own questions, verify reasoning chains, and improve indefinitely without expensive human annotations, based on research from Shanghai Innovation Institute, Fudan University, Shanghai AI Laboratory, and The Chinese University of Hong Kong.
Syllabus
Self Learning AI: Accelerate w/ new RL
Taught by
Discover AI