Get 20% off all career paths from fullstack to AI
Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This video explores Qwen's groundbreaking WorldPM (World Preference Model), a new approach to solving fundamental challenges in Reinforcement Learning from Human Feedback (RLHF). Learn how this innovative world model encodes human preferences at scale, potentially transforming how AI systems align with human values. The 21-minute presentation examines specific scaling laws regarding model size and effectiveness, building upon Qwen's previous models. Discover the technical foundations of Generalized Reward Preference Optimization (GRPO) and how it addresses RLHF's biggest limitations. The research represents a collaborative effort between Fudan University and the Qwen Team at Alibaba Group, with the model and paper publicly available on GitHub.
Syllabus
RLHF’s Missing Piece: Qwen’s World Model Aligns AI w/ Human Values (GRPO)
Taught by
Discover AI