Preference Optimization: DPO and Beyond
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Direct Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.
TL;DR
Preference optimization techniques like DPO (Direct Preference Optimization) align AI models with human preferences by training directly on examples of "better" vs. "worse" responses. DPO achieves similar results to the more complex RLHF pipeline but is simpler to implement, more stable to train, and faster. It's the technique behind much of the alignment in the AI tools you use today.
Why it matters
A raw language model, freshly pre-trained on internet text, is like a brilliant but completely unsocialized person. It can write fluently about any topic, but it's equally happy to help you write a thank-you note or explain how to do something dangerous. It might give a factual but unhelpful response, or a confident but completely wrong one. It has no concept of what makes a response "good."
Preference optimization is how we fix this. By showing the model thousands of examples where human evaluators said "this response is better than that one," the model learns to produce responses that are helpful, honest, and safe. This is why ChatGPT, Claude, and Gemini feel useful rather than chaotic. Without preference optimization, these tools would be impressive but impractical.
Understanding preference optimization matters whether you're building AI products (you need to know how alignment works under the hood), evaluating AI tools (understanding why different tools behave differently), or working with fine-tuning (preference optimization is the final and most impactful training stage).
The problem with raw language models
Pre-trained language models learn by predicting the next word in text from the internet. This gives them enormous knowledge and fluency, but it doesn't teach them to be helpful.
Consider a simple question: "How do I remove a stripped screw?" A raw language model might respond with any of these:
- A detailed, helpful step-by-step guide (from a DIY blog)
- An incomplete one-sentence answer (from a forum comment)
- A sales pitch for a screw extraction tool (from a product page)
- An unrelated tangent about screw manufacturing (from a Wikipedia article)
All of these existed in its training data. The model has no built-in preference for which type of response is most useful. Preference optimization teaches it to favor the helpful, complete answer.
How RLHF works (and why it's complicated)
Before DPO, the standard approach was Reinforcement Learning from Human Feedback (RLHF). Here's the pipeline:
Step 1: Collect preference data. Show human raters two responses to the same prompt and ask "which is better?" Repeat thousands of times to build a dataset of human preferences.
Step 2: Train a reward model. Build a separate AI model that predicts how much a human would like any given response. This model learns from the preference data to assign a "quality score" to any response.
Step 3: Optimize with reinforcement learning. Use the reward model as a scoring function and train the language model to generate responses that score highly. This uses PPO (Proximal Policy Optimization), a complex reinforcement learning algorithm.
The problems with this pipeline:
- Two models to train: You need a separate reward model, which is complex and expensive.
- Unstable RL training: PPO is notoriously finicky. Small hyperparameter changes can cause training to collapse.
- Reward hacking: The language model can learn to exploit quirks in the reward model rather than actually being helpful. It might find responses that score highly but aren't genuinely good.
- Engineering complexity: Managing two models, PPO training loops, and reward model updates requires specialized infrastructure.
DPO: the simpler alternative
DPO, published in 2023, solved these problems with a key insight: you don't actually need a separate reward model. You can train the language model directly on preference data.
How DPO works:
Collect preference data. Same as RLHF: pairs of (prompt, chosen response, rejected response) where humans picked the chosen response as better.
Train directly. DPO uses a mathematical trick to fold the reward model into the language model training itself. Instead of training a separate reward model and then using RL, DPO adjusts the language model's probabilities directly: increase the likelihood of generating chosen responses, decrease the likelihood of generating rejected responses.
That's it. No reward model, no RL, no PPO. A single training loop on the preference data.
The result? Models trained with DPO perform as well as RLHF-trained models (sometimes better) while being much simpler to implement, more stable during training, and cheaper to run.
How preference data is collected
The quality of preference optimization depends entirely on the quality of the preference data. Here's how companies collect it:
Human annotators are the gold standard. Trained raters compare two responses and select the better one. Top AI labs employ thousands of annotators with specific guidelines about what "better" means (more helpful, more accurate, safer, better formatted).
AI-assisted labeling uses an existing strong model to evaluate responses, which is faster and cheaper than human labeling. This is the "RLAIF" approach (Reinforcement Learning from AI Feedback) used in Constitutional AI.
User feedback from production systems provides real-world preference signals. When a user regenerates a response, gives a thumbs down, or edits an AI's output, that's implicit preference data.
Expert curation involves domain specialists creating preference pairs for specific skills (coding, medical advice, legal analysis). These high-quality pairs are expensive but highly effective for targeted improvements.
Most modern systems combine all four sources. Broad human annotation provides the foundation, AI labeling scales it up, user feedback adds real-world signal, and expert curation targets specific capabilities.
The helpfulness-safety tradeoff
Preference optimization reveals a fundamental tension in AI alignment: helpfulness and safety often pull in opposite directions.
A maximally helpful model would answer any question fully and completely. A maximally safe model would refuse any request that could possibly be misused. In practice, you need a balance, and where you draw the line is a values decision.
Too far toward safety: The model refuses legitimate requests, adds excessive caveats, or gives watered-down answers. Users get frustrated and find less safe alternatives.
Too far toward helpfulness: The model assists with harmful requests, generates unsafe content, or doesn't flag risks when it should.
Preference data encodes this balance. If annotators are instructed to prefer responses that always add safety disclaimers, the model learns to be cautious. If they prefer responses that directly answer questions, the model learns to be forthcoming. The annotator guidelines are, in effect, making a societal decision about how AI should behave.
Different companies make different choices here, which is why Claude, GPT, and Gemini have noticeably different "personalities" when handling sensitive topics. These aren't just engineering differences; they're different values expressed through preference data.
DPO variants and evolution
Since DPO's publication, researchers have developed several improvements:
IPO (Identity Preference Optimization): Addresses a theoretical issue in DPO where the model can overfit to the preference data. IPO adds regularization to produce more stable results.
KTO (Kahneman-Tversky Optimization): Named after the Nobel Prize-winning behavioral economists. KTO works with individual responses rated as good or bad, rather than requiring paired comparisons. This is easier to collect, since labeling single responses is simpler than comparing pairs.
ORPO (Odds Ratio Preference Optimization): Combines the initial supervised fine-tuning step with preference optimization into a single training stage, further simplifying the pipeline.
SimPO and other recent methods: An active research area in 2025-2026, with new techniques appearing regularly that improve efficiency, stability, or alignment quality.
The trend is clear: each new method is simpler and more efficient than the last, making strong alignment accessible to more teams with fewer resources.
Common mistakes
Using low-quality preference data. Garbage in, garbage out. If your annotators are inconsistent, rushing, or following unclear guidelines, the model learns contradictory preferences. Invest heavily in annotator training and guideline quality.
Not having enough diverse prompts. If your preference data only covers common questions, the model won't learn how to handle unusual or adversarial inputs. Include edge cases, challenging prompts, and safety-relevant scenarios in your preference dataset.
Ignoring the reference model. DPO works by measuring how much the model's response probabilities change relative to a reference model (typically the supervised fine-tuned model before DPO). If the reference model is poor, DPO's training signal is noisy. Make sure your base model is well-trained before applying DPO.
Over-optimizing for one metric. If you only optimize for helpfulness ratings, safety suffers. If you only optimize for safety, helpfulness suffers. Build preference data that covers multiple dimensions and evaluate across all of them.
Treating preference optimization as a one-time step. User needs and safety concerns evolve. Production systems need ongoing preference data collection and periodic retraining to stay aligned with current expectations.
What's next?
Preference optimization connects to several important topics:
- AI Alignment Fundamentals — The broader problem that preference optimization addresses
- Constitutional AI — An alternative alignment approach that can complement preference optimization
- Fine-Tuning Fundamentals — The training stage that comes before preference optimization
Frequently Asked Questions
Is DPO replacing RLHF entirely?
Largely, yes, especially for smaller teams and open-source models. DPO and its variants are now the default choice for most preference optimization work because they're simpler and more stable. However, some top AI labs still use RLHF (or hybrid approaches) for their largest models, where the extra complexity of RL can provide marginal improvements at scale.
How much preference data do I need for DPO?
For fine-tuning an existing model on a specific domain, as few as 1,000-5,000 high-quality preference pairs can make a noticeable difference. For general-purpose alignment of a large model, top labs use hundreds of thousands to millions of preference pairs. Quality matters more than quantity: 5,000 carefully curated pairs often outperform 50,000 noisy ones.
Can I use DPO to fine-tune open-source models?
Yes, and this is one of DPO's biggest advantages. Libraries like Hugging Face TRL (Transformer Reinforcement Learning) make it straightforward to apply DPO to any language model. Many of the top open-source models (Llama, Mistral, and derivatives) use DPO as their alignment method.
Why do different AI tools feel so different if they all use similar preference optimization?
Because the preference data is different. The annotator guidelines, the types of prompts included, the balance between helpfulness and safety, and the specific values encoded in the training data all vary between companies. The algorithm is similar, but the training data reflects different organizational values and priorities.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
RLHF (Reinforcement Learning from Human Feedback)
A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.
Model
The trained AI system that contains all the patterns and knowledge learned from data. It's the end product of training—the 'brain' that takes inputs and produces predictions, decisions, or generated content.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Related Guides
Training Efficient Models: Doing More with Less
AdvancedLearn techniques for training AI models efficiently. From data efficiency to compute optimization—practical approaches for reducing training costs and time.
10 min readAI Training Data Basics: What AI Learns From
BeginnerUnderstand how training data shapes AI behavior. From data collection to quality—what you need to know about the foundation of all AI systems.
9 min readData Labeling Fundamentals: Creating Quality Training Data
IntermediateLearn the essentials of data labeling for AI. From annotation strategies to quality control—practical guidance for creating the labeled data that AI needs to learn.
10 min read