Fine-tuning LLMs on Human Feedback (RLHF + DPO)

Fine-tuning LLMs on Human Feedback (RLHF + DPO)

Shaw Talebi via YouTube Direct link

Intro - 0:00

1 of 11

1 of 11

Intro - 0:00

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Fine-tuning LLMs on Human Feedback (RLHF + DPO)

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Intro - 0:00
  2. 2 Base Models - 0:25
  3. 3 InstructGPT - 2:20
  4. 4 RL from Human Feedback RLHF - 5:18
  5. 5 Proximal Policy Optimization PPO - 9:20
  6. 6 Limitations of RLHF - 10:30
  7. 7 Direct Policy Optimization DPO - 11:50
  8. 8 Example: Fine-tuning Qwen on Title Preferences - 14:29
  9. 9 Step 1: Curate preference data - 17:49
  10. 10 Step 2: Fine-tuning with DPO - 20:53
  11. 11 Step 3: Evaluate fine-tuning model - 25:27

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.