RLHF (Reinforcement Learning from Human Feedback)
Also known as: Reinforcement Learning from Human Feedback, RLHF, Human Feedback Training
In one sentence
A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.
Explain like I'm 12
Imagine training a dog by saying 'good boy' or 'bad boy' after each trick. RLHF works the same way — humans tell the AI which answers are good and which are bad, so it learns what people actually want.
In context
RLHF is the secret sauce behind ChatGPT, Claude, and other conversational AI assistants. After an LLM is pre-trained on text data, human raters compare pairs of responses and pick the better one. These preferences train a 'reward model' that scores responses, and the LLM is then fine-tuned to maximise that reward score. This process is why modern assistants feel helpful and safe rather than just autocompleting text.
See also
Related Guides
Learn more about RLHF (Reinforcement Learning from Human Feedback) in these guides:
RLHF Explained: Training AI from Human Feedback
IntermediateUnderstand Reinforcement Learning from Human Feedback. How modern AI systems learn from human preferences to become more helpful, harmless, and honest.
9 min readAI Safety and Alignment: Building Helpful, Harmless AI
IntermediateAI alignment ensures models do what we want them to do safely. Learn about RLHF, safety techniques, and responsible deployment.
9 min readPreference Optimization: DPO and Beyond
AdvancedDirect Preference Optimization (DPO) and variants train models on human preferences without separate reward models. Simpler, more stable than RLHF.
7 min read