Paper Summary: Deep Reinforcement Learning from Human Preferences

Last updated: 20 Jul 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

deep-reinforcement-learning-from-human-preferences-cover

Deep Reinforcement Learning from Human Preferences Source

WHAT

An algorithm to estimate a reward function using human opinions. The function is then optimized in a Reinforcement Learning (RL) setting.

This approach is now called RLHF (Reinforcement Learning from Human Preferences).

WHY

Because it isn't practical to mathematically formulate a reward function for some types of RL problems. But it is possible to ask humans to subjectively rate how preferable a given state is.

HOW

1) Show humans pairs of states and ask them to rank these states in terms of desirability (i.e. say which state is preferable);
2) Learn a reward function in a supervised manner using the data from step 1;
3) Train an RL model using the learned reward function as a proxy for the real reward.

CLAIMS/QUOTES

It is possible to use a learned reward function built from human preferences.
In some cases, a learned reward function performs better than an actual mathematical reward function.
Comparisons are easier for humans: "We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences"

EXTENDS/USES

OpenAI Gym

NOTES

Performance is evaluated on a set of robotics and video-game-playing RL tasks.
In addition to human feedback, authors also used so-called synthetic feedback—building a reward function from actual true signals.

MY 2¢

The term "RLHF" is not mentioned in the article.
RLHF is not introduced in this article. The authors' contributions revolve around making the process more efficient.
RLHF is relevant for NLP and instruction-tuning because it is not trivial to estimate how appropriate an output is to a given instruction. RLHF can be used to fine-tune a pre-trained LLM.
There exists a way to produce a function from pairwise preference rankings—the Bradley-Terry model.

References

Arxiv: Christiano et al., 2017: Deep Reinforcement Learning from Human Preferences

Felipe 15 Jul 2023 20 Jul 2025 paper-summary reinforcement-learning rlhf