Constitutional AI: Teaching Models to Self-Critique
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Constitutional AI trains models to follow principles, self-critique, and revise harmful outputs without human feedback on every example.
TL;DR
Constitutional AI is a technique developed by Anthropic that gives AI models a set of written principles (a "constitution") and trains them to evaluate and improve their own responses against those principles. Instead of relying on humans to rate every single output, the AI learns to self-critique and self-correct, making alignment more scalable and transparent.
Why it matters
Every AI model needs guardrails. Without them, a language model will happily generate harmful content, give dangerous advice, or assist with illegal activities because it's just predicting the next likely word. The question isn't whether to add guardrails, it's how.
The traditional approach, Reinforcement Learning from Human Feedback (RLHF), works by having thousands of human reviewers rate AI outputs as good or bad. This is effective, but it's expensive, slow, and opaque. You're essentially encoding values into a model through thumbs up/thumbs down ratings, without being explicit about what those values are.
Constitutional AI takes a different approach: write down the rules. Instead of implicitly learning values from human ratings, the model gets explicit principles it can reference, reason about, and apply. This matters because it makes AI safety more transparent (you can read the constitution), more scalable (the AI does much of its own evaluation), and more adjustable (change the rules by editing the document).
If you use Claude, you're using a model shaped by Constitutional AI. Understanding how it works helps you understand why Claude behaves the way it does, and why it sometimes refuses requests or adds caveats.
How Constitutional AI works
The process has two main phases. Think of it as first teaching a student to self-edit, then training them to internalize those editing skills.
Phase 1: Supervised self-critique
Start with a language model that's already been pre-trained on large amounts of text. Then run this cycle:
1. Generate. Ask the model to respond to a variety of prompts, including tricky ones that might produce harmful outputs.
2. Critique. Show the model its own response alongside a constitutional principle (for example, "Choose the response that is most supportive and encouraging of life, liberty, and personal security"). Ask it to evaluate whether its response follows this principle.
3. Revise. The model rewrites its response to better align with the principle.
This generates a dataset of (original response, revised response) pairs. The revised versions become training data. The model learns to produce the "revised" quality responses directly, without needing the critique step.
Phase 2: Reinforcement Learning from AI Feedback (RLAIF)
This is where Constitutional AI diverges most sharply from RLHF. Instead of humans rating which response is better, the AI itself evaluates responses against the constitution.
1. Generate pairs of responses to the same prompt.
2. Have the AI judge which response better follows the constitutional principles.
3. Use these AI-generated preferences to train a reward model.
4. Fine-tune the language model using the reward model, just like in RLHF.
The key insight: having the AI evaluate responses against written principles is more scalable than having humans rate every pair. You get thousands of preference judgments per hour instead of dozens.
The "constitution" itself
A constitution is a set of clear, written principles that define how the AI should behave. These aren't vague aspirations. They're specific enough that the AI can use them to evaluate its own responses.
Anthropic's original Constitutional AI paper included principles drawn from multiple sources:
- The Universal Declaration of Human Rights
- Apple's terms of service (as an example of a practical content policy)
- Principles about being helpful, harmless, and honest
- Specific rules about avoiding harmful content, respecting privacy, and providing balanced information
A typical principle might read: "Choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior."
The beauty of this approach is that the principles are readable by humans. You can look at a constitution and understand what values the AI is trained to follow. This is a significant improvement over RLHF, where the values are embedded invisibly in a reward model.
How it differs from RLHF
| Aspect | RLHF | Constitutional AI |
|---|---|---|
| Who evaluates? | Human reviewers | AI using written principles |
| Values are... | Implicit in ratings | Explicit in constitution |
| Scalability | Limited by human reviewers | Scales with compute |
| Transparency | Hard to audit | Principles are readable |
| Cost | Expensive (human labor) | Cheaper per evaluation |
| Consistency | Varies between reviewers | More consistent |
In practice, modern AI systems often combine both approaches. You might use Constitutional AI as the primary alignment method and supplement it with human feedback for edge cases or to validate that the constitution is producing the desired behavior.
Practical implications
For developers building with AI
Understanding Constitutional AI helps you work with models like Claude more effectively. When Claude declines a request or adds safety caveats, it's not following hard-coded rules. It's applying internalized principles. This means you can often rephrase requests to clarify your legitimate intent, and the model will help.
It also means that different AI providers make different constitutional choices, which is why Claude, GPT, and Gemini behave differently in similar situations. They have different "constitutions" (whether explicit or implicit).
For organizations adopting AI
Constitutional AI suggests a framework for your own AI governance. Before deploying AI systems, write down your principles explicitly. What should the AI do? What should it refuse to do? Having explicit rules is better than hoping the model "does the right thing."
For the broader AI safety conversation
Constitutional AI demonstrates that AI alignment doesn't have to be a black box. We can make AI values transparent and adjustable. This is important for public trust, regulatory compliance, and the ongoing conversation about how AI should behave.
Common mistakes
Thinking the constitution guarantees perfect behavior. A constitution improves alignment, but it doesn't guarantee it. Models can still misinterpret principles, encounter situations the constitution didn't anticipate, or find edge cases where principles conflict. Constitutional AI makes behavior more predictable, not perfect.
Assuming all AI models use explicit constitutions. Many models use RLHF or similar implicit approaches. When you encounter unexpected AI behavior, it's worth asking whether the model's values were explicitly specified or learned implicitly from human ratings.
Confusing Constitutional AI with hard-coded rules. Constitutional AI doesn't add if/then rules to the model. It trains the model to internalize principles so they influence every response. This means the model can apply principles to novel situations it hasn't seen before, but also means it can occasionally misapply them.
Thinking you can just write a perfect constitution. Constitution design is genuinely hard. Principles can conflict with each other (helpfulness vs. safety), be too vague to apply consistently, or miss important edge cases. Good constitutions require iteration, testing, and ongoing refinement.
What's next?
Constitutional AI connects to several important alignment and safety topics:
- AI Alignment Fundamentals — The broader challenge that Constitutional AI addresses
- Preference Optimization — DPO and other methods for training models on human preferences
- RLHF Explained — The approach that Constitutional AI builds upon and improves
Frequently Asked Questions
Does Claude use Constitutional AI?
Yes. Claude is built by Anthropic, the company that developed Constitutional AI. Anthropic has published research on the constitutional principles used in Claude's training. While the exact current constitution isn't fully public, the approach is central to how Claude is aligned to be helpful, harmless, and honest.
Can I create my own constitution for a custom AI model?
The Constitutional AI training process requires significant resources to implement from scratch. However, the concept of explicit principles applies at every scale. When building AI applications, you can use system prompts with explicit principles as a lightweight version of a constitution. For custom model training, frameworks like the Anthropic research papers provide templates for constitutional principles.
Is Constitutional AI better than RLHF?
They solve different aspects of the same problem and work best together. Constitutional AI is better for scalability and transparency. RLHF is better for capturing subtle human preferences that are hard to write down as rules. Most modern alignment approaches combine both: Constitutional AI for the broad framework, plus human feedback for refinement.
What happens when constitutional principles conflict with each other?
This is one of the hardest challenges in Constitutional AI design. For example, being maximally helpful might sometimes conflict with being maximally safe. In practice, the model learns to weigh principles based on context, similar to how a doctor balances 'do no harm' with 'provide treatment.' Researchers handle this through principle prioritization and extensive testing of edge cases.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Constitutional AI
A safety technique where an AI is trained using a set of principles (a 'constitution') to critique and revise its own outputs, making them more helpful, honest, and harmless without human feedback on every response.
Model
The trained AI system that contains all the patterns and knowledge learned from data. It's the end product of training—the 'brain' that takes inputs and produces predictions, decisions, or generated content.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Related Guides
AI Alignment Fundamentals: Making AI Follow Human Intent
IntermediateUnderstand the challenge of AI alignment. From goal specification to value learning—why ensuring AI does what we want is harder than it sounds.
10 min readRLHF Explained: Training AI from Human Feedback
IntermediateUnderstand Reinforcement Learning from Human Feedback. How modern AI systems learn from human preferences to become more helpful, harmless, and honest.
9 min readPreference Optimization: DPO and Beyond
AdvancedDirect Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.
7 min read