RLHF (Reinforcement Learning from Human Feedback)
Also known as: Reinforcement Learning from Human Feedback, RLHF, Human Feedback Training
In one sentence
A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.
Explain like I'm 12
Training AI by having real people give it grades on its answers—like a teacher marking homework—so it learns what good responses look like.
In context
Used to train ChatGPT, Claude, and other assistants to be more helpful and safe. Humans compare outputs and choose better ones, teaching the AI preferences.
See also
Related Guides
Learn more about RLHF (Reinforcement Learning from Human Feedback) in these guides:
AI Safety and Alignment: Building Helpful, Harmless AI
IntermediateAI alignment ensures models do what we want them to do safely. Learn about RLHF, safety techniques, and responsible deployment.
7 min readPreference Optimization: DPO and Beyond
AdvancedDirect Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.
7 min read