TL;DR

DPO trains models directly on preference data (A > B) without training a separate reward model like RLHF. Simpler, faster, more stable alignment.

RLHF challenges

  • Two-stage: Train reward model, then policy
  • Unstable RL training
  • Reward model can be inaccurate
  • Complex to implement

DPO approach

Single-stage optimization directly on preferences:

  • Collect preference pairs (good vs bad responses)
  • Train model to increase likelihood of preferred responses
  • No reward model needed

Variants

IPO (Identity Preference Optimization): More stable than DPO
KTO (Kahneman-Tversky Optimization): Uses single examples
ORPO: Combines SFT and preference learning

Implementation

Requires preference dataset with (prompt, chosen, rejected) triplets. Simpler than RLHF, easier to tune.

Results

Comparable or better than RLHF, faster training, fewer hyperparameters.