RLHF (Reinforcement Learning from Human Feedback)

Also known as: Reinforcement Learning from Human Feedback, RLHF, Human Feedback Training

In one sentence

A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.

Explain like I'm 12

Training AI by having real people give it grades on its answers—like a teacher marking homework—so it learns what good responses look like.

In context

Used to train ChatGPT, Claude, and other assistants to be more helpful and safe. Humans compare outputs and choose better ones, teaching the AI preferences.

Related Guides

Learn more about RLHF (Reinforcement Learning from Human Feedback) in these guides:

AI Safety and Alignment: Building Helpful, Harmless AI

Intermediate

AI alignment ensures models do what we want them to do safely. Learn about RLHF, safety techniques, and responsible deployment.

7 min read

Preference Optimization: DPO and Beyond

Advanced

Direct Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.