Skip to main content
BETAThis is a new design — give feedback

RLHF (Reinforcement Learning from Human Feedback)

Also known as: Reinforcement Learning from Human Feedback, RLHF, Human Feedback Training

In one sentence

A training method where humans rate AI outputs to teach the model which responses are helpful, harmless, and accurate.

Explain like I'm 12

Imagine training a dog by saying 'good boy' or 'bad boy' after each trick. RLHF works the same way — humans tell the AI which answers are good and which are bad, so it learns what people actually want.

In context

RLHF is the secret sauce behind ChatGPT, Claude, and other conversational AI assistants. After an LLM is pre-trained on text data, human raters compare pairs of responses and pick the better one. These preferences train a 'reward model' that scores responses, and the LLM is then fine-tuned to maximise that reward score. This process is why modern assistants feel helpful and safe rather than just autocompleting text.

See also

Related Guides

Learn more about RLHF (Reinforcement Learning from Human Feedback) in these guides: