Paper Summary: Deep Reinforcement Learning from Human Preferences

Paper Summary: Deep Reinforcement Learning from Human Preferences

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

deep-reinforcement-learning-from-human-preferences-cover Deep Reinforcement Learning from Human Preferences Source


An algorithm to estimate a reward function using human opinions. The function is then optimized in a Reinforcement Learning (RL) setting.

This approach is now called RLHF (Reinforcement Learning from Human Preferences).


Because it isn't practical to mathematically formulate a reward function for some types of RL problems. But it is possible to ask humans to subjectively rate how preferable a given state is.


  • 1) Show humans pairs of states and ask them to rank these states in terms of desirability (i.e. say which state is preferable);

  • 2) Learn a reward function in a supervised manner using the data from step 1;

  • 3) Train an RL model using the learned reward function as a proxy for the real reward.


  • It is possible to use a learned reward function built from human preferences.

  • In some cases, a learned reward function performs better than an actual mathematical reward function.


  • OpenAI Gym


  • Performance is evaluated on a set of robotics and video-game-playing RL tasks.

  • In addition to human feedback, authors also used so-called synthetic feedback—building a reward function from actual true signals.

MY 2¢

  • The term "RLHF" is not mentioned in the article.

  • RLHF is not introduced in this article. The authors' contributions revolve around making the process more efficient.

  • RLHF is relevant for NLP and instruction-tuning because it is not trivial to estimate how appropriate an output is to a given instruction. RLHF can be used to fine-tune a pre-trained LLM.

  • There exists a way to produce a function from pairwise preference rankings—the Bradley-Terry model.


Dialogue & Discussion