Constitutional AI
Also known as: CAI, Constitutional Training, Self-Critique
In one sentence
A safety technique where an AI is trained using a set of principles (a 'constitution') to critique and revise its own outputs, making them more helpful, honest, and harmless without human feedback on every response.
Explain like I'm 12
The AI has a rulebook (constitution) like 'be helpful, be honest, don't be harmful.' It reads its own answer, checks if it follows the rules, and rewrites it if needed—all before showing you. It's like having a built-in editor.
In context
Example: Anthropic's Claude uses Constitutional AI with principles like 'choose the response that is most respectful' and 'avoid helping with illegal activities.' The AI generates responses, critiques them against these principles, and refines them automatically.
See also
Related Guides
Learn more about Constitutional AI in these guides:
Constitutional AI: Teaching Models to Self-Critique
AdvancedConstitutional AI trains models to follow principles, self-critique, and revise harmful outputs without human feedback on every example.
7 min readRLHF Explained: Training AI from Human Feedback
IntermediateUnderstand Reinforcement Learning from Human Feedback. How modern AI systems learn from human preferences to become more helpful, harmless, and honest.
9 min readAI Alignment Fundamentals: Making AI Follow Human Intent
IntermediateUnderstand the challenge of AI alignment. From goal specification to value learning—why ensuring AI does what we want is harder than it sounds.
10 min read