Learn

RLHF: tuning a model on human preference

RLHF (Reinforcement Learning from Human Feedback) is a training stage that adjusts a model using human judgements of its outputs. People rank or rate candidate answers, those preferences train a reward signal, and the model is then optimised to produce answers the signal scores highly. It is how a fluent base model is shaped into one that follows instructions and behaves the way its makers intend.

At a glance

What it is
A training stage that tunes a model on human preference judgements
Stands for
Reinforcement Learning from Human Feedback
Why it matters
Turns a fluent base model into one that follows instructions and behaves
Where it sits
After pretraining, as part of the alignment and instruction-tuning step

What problem does RLHF solve?

A pretrained model learns to predict likely next text from a vast corpus. That makes it fluent, but fluent is not the same as helpful. Ask such a model a question and it may continue the text plausibly rather than answer what you meant. There is no built-in sense of which of two valid-looking answers a person would actually prefer.

RLHF (Reinforcement Learning from Human Feedback) adds that sense. People look at candidate outputs and judge which is better. Those judgements train a reward signal, a stand-in for human preference, and the model is then optimised to produce answers the signal rates highly. The result is a model that follows instructions and behaves more like the assistant its makers had in mind.

What does that mean for you as an operator?

Two things worth keeping in front of you. First, the behaviour you see is shaped by whoever did the rating. A model’s refusals, its tone, its idea of a good answer all carry the preferences baked in during this stage. That is a design choice, not a law of nature.

Second, RLHF tunes behaviour, not facts. It can make a model more polite, more on-task, and better at declining what it should decline. It does not give the model a reliable grip on truth, so it does not, on its own, stop the model from stating something confident and wrong. Preference tuning and factual grounding are different jobs.

Pretraining alone

  • Predicts likely next text from a huge corpus
  • Fluent, but not steered toward being helpful
  • May answer the literal prompt, not the intended one
  • No notion of which of two answers a person prefers

After RLHF

  • Tuned toward answers people rated as better
  • Follows instructions and stays in the intended lane
  • Tries to give the answer you actually wanted
  • Carries the preferences of whoever did the rating

Related terms

← All terms Reviewed: June 2026