RLHF: tuning a model on human preference : Learn

RLHF (Reinforcement Learning from Human Feedback) is a training stage that adjusts a model using human judgements of its outputs. People rank or rate candidate answers, those preferences train a reward signal, and the model is then optimised to produce answers the signal scores highly. It is how a fluent base model is shaped into one that follows instructions and behaves the way its makers intend.

What problem does RLHF solve?

A pretrained model learns to predict likely next text from a vast corpus. That makes it fluent, but fluent is not the same as helpful. Ask such a model a question and it may continue the text plausibly rather than answer what you meant. There is no built-in sense of which of two valid-looking answers a person would actually prefer.

RLHF (Reinforcement Learning from Human Feedback) adds that sense. People look at candidate outputs and judge which is better. Those judgements train a reward signal, a stand-in for human preference, and the model is then optimised to produce answers the signal rates highly. The result is a model that follows instructions and behaves more like the assistant its makers had in mind.

What does that mean for you as an operator?

Two things worth keeping in front of you. First, the behaviour you see is shaped by whoever did the rating. A model’s refusals, its tone, its idea of a good answer all carry the preferences baked in during this stage. That is a design choice, not a law of nature.

Second, RLHF tunes behaviour, not facts. It can make a model more polite, more on-task, and better at declining what it should decline. It does not give the model a reliable grip on truth, so it does not, on its own, stop the model from stating something confident and wrong. Preference tuning and factual grounding are different jobs.

RLHF: tuning a model on human preference

At a glance

What problem does RLHF solve?

What does that mean for you as an operator?

Pretraining alone

After RLHF

Related terms

At a glance

What problem does RLHF solve?

What does that mean for you as an operator?

Pretraining alone

After RLHF

Related terms

Go deeper