Preference Optimization: DPO and Beyond
Direct Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.
TL;DR
DPO trains models directly on preference data (A > B) without training a separate reward model like RLHF. Simpler, faster, more stable alignment.
RLHF challenges
- Two-stage: Train reward model, then policy
- Unstable RL training
- Reward model can be inaccurate
- Complex to implement
DPO approach
Single-stage optimization directly on preferences:
- Collect preference pairs (good vs bad responses)
- Train model to increase likelihood of preferred responses
- No reward model needed
Variants
IPO (Identity Preference Optimization): More stable than DPO
KTO (Kahneman-Tversky Optimization): Uses single examples
ORPO: Combines SFT and preference learning
Implementation
Requires preference dataset with (prompt, chosen, rejected) triplets. Simpler than RLHF, easier to tune.
Results
Comparable or better than RLHF, faster training, fewer hyperparameters.
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
RLHF (Reinforcement Learning from Human Feedback)
A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.
Model
The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligenceâlike understanding language, recognizing patterns, or making decisions.
Related Guides
Training Efficient Models: Doing More with Less
AdvancedLearn techniques for training AI models efficiently. From data efficiency to compute optimizationâpractical approaches for reducing training costs and time.
AI Training Data Basics: What AI Learns From
BeginnerUnderstand how training data shapes AI behavior. From data collection to qualityâwhat you need to know about the foundation of all AI systems.
Data Labeling Fundamentals: Creating Quality Training Data
IntermediateLearn the essentials of data labeling for AI. From annotation strategies to quality controlâpractical guidance for creating the labeled data that AI needs to learn.