Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

RLHF's Missing Piece: Qwen's World Model Aligns AI with Human Values - GRPO

Discover AI via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This video explores Qwen's groundbreaking WorldPM (World Preference Model), a new approach to solving fundamental challenges in Reinforcement Learning from Human Feedback (RLHF). Learn how this innovative world model encodes human preferences at scale, potentially transforming how AI systems align with human values. The 21-minute presentation examines specific scaling laws regarding model size and effectiveness, building upon Qwen's previous models. Discover the technical foundations of Generalized Reward Preference Optimization (GRPO) and how it addresses RLHF's biggest limitations. The research represents a collaborative effort between Fudan University and the Qwen Team at Alibaba Group, with the model and paper publicly available on GitHub.

Syllabus

RLHF’s Missing Piece: Qwen’s World Model Aligns AI w/ Human Values (GRPO)

Taught by

Discover AI

Reviews

Start your review of RLHF's Missing Piece: Qwen's World Model Aligns AI with Human Values - GRPO

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.