TL;DR

RLHF trains AI by learning from human preferences rather than explicit rules. Humans compare AI outputs and indicate which is better; AI learns to produce outputs humans prefer. This is how ChatGPT, Claude, and other modern assistants become helpful and safe.

Why it matters

RLHF transformed AI from "predict the next word" to "be genuinely helpful." It's the key technique that made AI assistants useful for everyday tasks. Understanding RLHF helps you understand modern AI capabilities and limitations.

How RLHF works

The three stages

Stage 1: Pre-training

  • Train base model on large text corpus
  • Learns language patterns and knowledge
  • Good at completing text, not conversations

Stage 2: Supervised fine-tuning

  • Human demonstrators write ideal responses
  • Model learns from these examples
  • Better at conversation format

Stage 3: RLHF

  • Humans compare model outputs
  • Train reward model on preferences
  • Optimize AI to maximize reward

Step by step

1. Generate multiple responses to prompt
2. Human ranks responses (A > B > C)
3. Train reward model on rankings
4. Use reinforcement learning to improve
5. Repeat with more comparisons

Key components

The reward model

Predicts human preferences:

Training:

  • Input: Prompt + response
  • Output: Scalar "quality" score
  • Learns from human comparisons

Purpose:

  • Evaluates any response
  • Provides training signal
  • Scales human feedback

Reinforcement learning

Optimizes the AI model:

Process:

  • AI generates response
  • Reward model scores it
  • AI updated to get higher scores
  • With constraints to prevent gaming

Key technique: PPO
Proximal Policy Optimization prevents model from changing too drastically, maintaining coherent behavior.

Human feedback collection

Humans provide the ground truth:

Comparison format:

  • "Which response is better?"
  • "Rate these responses"
  • "Is this response problematic?"

Quality matters:

  • Careful guidelines
  • Trained raters
  • Quality checks

What RLHF teaches

Helpfulness

  • Actually answer questions
  • Follow instructions
  • Provide useful information

Harmlessness

  • Refuse dangerous requests
  • Avoid toxic content
  • Respect privacy

Honesty

  • Admit uncertainty
  • Correct mistakes
  • Not make up facts

Limitations of RLHF

Reward hacking

Model finds ways to get high rewards without genuine quality:

  • Longer responses (often rated higher)
  • Hedging and caveats
  • Sycophantic agreement

Human feedback problems

Human raters aren't perfect:

  • Inconsistent judgments
  • Biased preferences
  • Can't evaluate all topics

Scalability

Human feedback is expensive:

  • Limited by human time
  • Can't cover all topics
  • Ongoing cost

Goodhart's Law

When measure becomes target, it stops being good measure:

  • Model optimizes for reward
  • May not match actual preference
  • Need diverse evaluation

Beyond basic RLHF

Constitutional AI

Add principle-based self-critique:

  • Define principles
  • AI critiques own outputs
  • Less reliant on human feedback

AI feedback

Use AI to provide feedback:

  • More scalable
  • Risk of amplifying errors
  • Careful validation needed

Direct preference optimization

Simpler alternative to RL:

  • Skip reward model
  • Train directly on preferences
  • Increasingly popular

Practical implications

For users

Understanding behavior:

  • AI is trained to be helpful
  • May be overly cautious
  • Preferences shaped by rater guidelines

Working effectively:

  • Clear instructions help
  • AI tries to satisfy preferences
  • Feedback shapes future training

For builders

Custom fine-tuning:

  • Can fine-tune on your preferences
  • Need quality feedback data
  • Consider alignment implications

Common mistakes

Mistake Problem Prevention
Assuming perfect alignment RLHF isn't perfect Verify behavior
Ignoring reward hacking Model games system Monitor for patterns
Low-quality feedback Garbage in, garbage out Quality rater training
Over-optimization Model becomes sycophantic Regularization, diverse eval

What's next

Explore AI training further: