Learn Excel & Financial Modeling the Way Finance Teams Actually Use Them
Free courses from frontend to fullstack and AI
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore groundbreaking research from Princeton University and University of Illinois that reveals critical flaws in implicit reward models used for reinforcement learning alignment. Discover why Direct Preference Optimization (DPO) and similar implicit reward approaches fail to generalize effectively, while traditional explicit reward models from Reinforcement Learning from Human Feedback (RLHF) continue to perform exceptionally well. Examine the significant performance gap between these two approaches and understand the underlying mechanisms that cause implicit reward models to struggle with generalization. Learn about the latest findings from researchers Noam Razin, Yong Lin, Jiarui Yao, and Sanjeev Arora as they investigate why language models serve as poor implicit reward models and what this means for the future of AI alignment strategies.
Syllabus
AI FALLS: DPO RL crumbles (Princeton)
Taught by
Discover AI