Advanced7 min read

Preference Optimization: DPO and Beyond

Direct Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.

DPORLHFpreference learningalignment

TL;DR

DPO trains models directly on preference data (A > B) without training a separate reward model like RLHF. Simpler, faster, more stable alignment.

RLHF challenges

Two-stage: Train reward model, then policy
Unstable RL training
Reward model can be inaccurate
Complex to implement

DPO approach

Single-stage optimization directly on preferences:

Collect preference pairs (good vs bad responses)
Train model to increase likelihood of preferred responses
No reward model needed

Variants

IPO (Identity Preference Optimization): More stable than DPO
KTO (Kahneman-Tversky Optimization): Uses single examples
ORPO: Combines SFT and preference learning

Implementation

Requires preference dataset with (prompt, chosen, rejected) triplets. Simpler than RLHF, easier to tune.

Results

Comparable or better than RLHF, faster training, fewer hyperparameters.

Was this guide helpful?

Your feedback helps us improve our guides

Key Terms Used in This Guide

RLHF (Reinforcement Learning from Human Feedback)

A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.

Model

The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.

AI (Artificial Intelligence)

Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.

Related Guides

Constitutional AI: Teaching Models to Self-Critique

Advanced

Constitutional AI trains models to follow principles, self-critique, and revise harmful outputs without human feedback on every example.

7 min read

AI Safety and Alignment: Building Helpful, Harmless AI

Intermediate

AI alignment ensures models do what we want them to do safely. Learn about RLHF, safety techniques, and responsible deployment.

7 min read

Active Learning: Smart Data Labeling

Advanced

Reduce labeling costs by intelligently selecting which examples to label. Active learning strategies for efficient model training.

6 min read