Constitutional AI: Teaching Models to Self-Critique
Constitutional AI trains models to follow principles, self-critique, and revise harmful outputs without human feedback on every example.
TL;DR
Constitutional AI gives models explicit principles ("constitution"), trains them to self-critique against those principles, and revise outputs to align. Reduces reliance on human labeling.
How it works
Phase 1: Supervised learning
- Model generates responses
- Critiques own responses against constitution
- Revises to be more aligned
Phase 2: RL from AI Feedback (RLAIF)
- Train reward model using AI feedback (not human)
- Fine-tune with RL to maximize reward
Constitution example principles
- "Be helpful, harmless, and honest"
- "Avoid toxic, biased, or violent content"
- "Respect user privacy"
- "Provide balanced perspectives"
Benefits
- Scalable (less human labor)
- Transparent (explicit principles)
- Customizable (change constitution)
Limitations
- Constitution quality matters
- Model must understand principles
- Not perfect adherence
Was this guide helpful?
Your feedback helps us improve our guides
Key Terms Used in This Guide
Constitutional AI
A safety technique where an AI is trained using a set of principles (a 'constitution') to critique and revise its own outputs, making them more helpful, honest, and harmless without human feedback on every response.
Model
The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligenceālike understanding language, recognizing patterns, or making decisions.
Related Guides
Preference Optimization: DPO and Beyond
AdvancedDirect Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.
AI Safety and Alignment: Building Helpful, Harmless AI
IntermediateAI alignment ensures models do what we want them to do safely. Learn about RLHF, safety techniques, and responsible deployment.
Active Learning: Smart Data Labeling
AdvancedReduce labeling costs by intelligently selecting which examples to label. Active learning strategies for efficient model training.