Advanced7 min read

Constitutional AI: Teaching Models to Self-Critique

Constitutional AI trains models to follow principles, self-critique, and revise harmful outputs without human feedback on every example.

constitutional AIalignmentsafetyRLAIF

TL;DR

Constitutional AI gives models explicit principles ("constitution"), trains them to self-critique against those principles, and revise outputs to align. Reduces reliance on human labeling.

How it works

Phase 1: Supervised learning

Model generates responses
Critiques own responses against constitution
Revises to be more aligned

Phase 2: RL from AI Feedback (RLAIF)

Train reward model using AI feedback (not human)
Fine-tune with RL to maximize reward

Constitution example principles

"Be helpful, harmless, and honest"
"Avoid toxic, biased, or violent content"
"Respect user privacy"
"Provide balanced perspectives"

Benefits

Scalable (less human labor)
Transparent (explicit principles)
Customizable (change constitution)

Limitations

Constitution quality matters
Model must understand principles
Not perfect adherence

Was this guide helpful?

Your feedback helps us improve our guides

Key Terms Used in This Guide

Constitutional AI

A safety technique where an AI is trained using a set of principles (a 'constitution') to critique and revise its own outputs, making them more helpful, honest, and harmless without human feedback on every response.

Model

The trained AI system that contains all the patterns it learned from data. Think of it as the 'brain' that makes predictions or decisions.

AI (Artificial Intelligence)

Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.

Related Guides

Preference Optimization: DPO and Beyond

Advanced

Direct Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.

7 min read

AI Safety and Alignment: Building Helpful, Harmless AI

Intermediate

AI alignment ensures models do what we want them to do safely. Learn about RLHF, safety techniques, and responsible deployment.

7 min read

Active Learning: Smart Data Labeling

Advanced

Reduce labeling costs by intelligently selecting which examples to label. Active learning strategies for efficient model training.

6 min read