TL;DR

Constitutional AI gives models explicit principles ("constitution"), trains them to self-critique against those principles, and revise outputs to align. Reduces reliance on human labeling.

How it works

Phase 1: Supervised learning

  • Model generates responses
  • Critiques own responses against constitution
  • Revises to be more aligned

Phase 2: RL from AI Feedback (RLAIF)

  • Train reward model using AI feedback (not human)
  • Fine-tune with RL to maximize reward

Constitution example principles

  • "Be helpful, harmless, and honest"
  • "Avoid toxic, biased, or violent content"
  • "Respect user privacy"
  • "Provide balanced perspectives"

Benefits

  • Scalable (less human labor)
  • Transparent (explicit principles)
  • Customizable (change constitution)

Limitations

  • Constitution quality matters
  • Model must understand principles
  • Not perfect adherence