Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn Direct Preference Optimization (DPO), a method for preference tuning large language models that eliminates the need for reward functions by using only preference data. Explore the complete mathematical derivation from initial concept to final objective, understanding how DPO offers more efficient training compared to methods like PPO and GRPO. Follow along as the tutorial breaks down the problem statement, walks through the detailed mathematical derivation, and demonstrates why this approach makes language model training more streamlined by removing the complexity of reward function training.
Syllabus
00:00 Introduction
01:02 Problem Statement
03:08 Derivation
16:21 Outro
Taught by
Outlier