Constitutional AI: Teaching Models to Self-Critique
Constitutional AI trains models to follow principles, self-critique, and revise harmful outputs without human feedback on every example.
TL;DR
Constitutional AI gives models explicit principles ("constitution"), trains them to self-critique against those principles, and revise outputs to align. Reduces reliance on human labeling.
How it works
Phase 1: Supervised learning
- Model generates responses
- Critiques own responses against constitution
- Revises to be more aligned
Phase 2: RL from AI Feedback (RLAIF)
- Train reward model using AI feedback (not human)
- Fine-tune with RL to maximize reward
Constitution example principles
- "Be helpful, harmless, and honest"
- "Avoid toxic, biased, or violent content"
- "Respect user privacy"
- "Provide balanced perspectives"
Benefits
- Scalable (less human labor)
- Transparent (explicit principles)
- Customizable (change constitution)
Limitations
- Constitution quality matters
- Model must understand principles
- Not perfect adherence
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Constitutional AI
A safety technique where an AI is trained using a set of principles (a 'constitution') to critique and revise its own outputs, making them more helpful, honest, and harmless without human feedback on every response.
Model
The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligenceālike understanding language, recognizing patterns, or making decisions.
Related Guides
AI Alignment Fundamentals: Making AI Follow Human Intent
IntermediateUnderstand the challenge of AI alignment. From goal specification to value learningāwhy ensuring AI does what we want is harder than it sounds.
RLHF Explained: Training AI from Human Feedback
IntermediateUnderstand Reinforcement Learning from Human Feedback. How modern AI systems learn from human preferences to become more helpful, harmless, and honest.
Preference Optimization: DPO and Beyond
AdvancedDirect Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.