Preference Optimization: DPO and Beyond
Direct Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.
TL;DR
DPO trains models directly on preference data (A > B) without training a separate reward model like RLHF. Simpler, faster, more stable alignment.
RLHF challenges
- Two-stage: Train reward model, then policy
- Unstable RL training
- Reward model can be inaccurate
- Complex to implement
DPO approach
Single-stage optimization directly on preferences:
- Collect preference pairs (good vs bad responses)
- Train model to increase likelihood of preferred responses
- No reward model needed
Variants
IPO (Identity Preference Optimization): More stable than DPO
KTO (Kahneman-Tversky Optimization): Uses single examples
ORPO: Combines SFT and preference learning
Implementation
Requires preference dataset with (prompt, chosen, rejected) triplets. Simpler than RLHF, easier to tune.
Results
Comparable or better than RLHF, faster training, fewer hyperparameters.
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
RLHF (Reinforcement Learning from Human Feedback)
A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.
Model
The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligenceālike understanding language, recognizing patterns, or making decisions.
Related Guides
Constitutional AI: Teaching Models to Self-Critique
AdvancedConstitutional AI trains models to follow principles, self-critique, and revise harmful outputs without human feedback on every example.
AI Safety and Alignment: Building Helpful, Harmless AI
IntermediateAI alignment ensures models do what we want them to do safely. Learn about RLHF, safety techniques, and responsible deployment.
Active Learning: Smart Data Labeling
AdvancedReduce labeling costs by intelligently selecting which examples to label. Active learning strategies for efficient model training.