Paper Summary: Deep Reinforcement Learning from Human Preferences
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
An algorithm to estimate a reward function using human opinions. The function is then optimized in a Reinforcement Learning (RL) setting.
This approach is now called RLHF (Reinforcement Learning from Human Preferences).
WHY
Because it isn't practical to mathematically formulate a reward function for some types of RL problems. But it is possible to ask humans to subjectively rate how preferable a given state is.
HOW
1) Show humans pairs of states and ask them to rank these states in terms of desirability (i.e. say which state is preferable);
2) Learn a reward function in a supervised manner using the data from step 1;
3) Train an RL model using the learned reward function as a proxy for the real reward.
CLAIMS
It is possible to use a learned reward function built from human preferences.
In some cases, a learned reward function performs better than an actual mathematical reward function.
EXTENDS/USES
- OpenAI Gym
NOTES
Performance is evaluated on a set of robotics and video-game-playing RL tasks.
In addition to human feedback, authors also used so-called synthetic feedback—building a reward function from actual true signals.
MY 2¢
The term "RLHF" is not mentioned in the article.
RLHF is not introduced in this article. The authors' contributions revolve around making the process more efficient.
RLHF is relevant for NLP and instruction-tuning because it is not trivial to estimate how appropriate an output is to a given instruction. RLHF can be used to fine-tune a pre-trained LLM.
There exists a way to produce a function from pairwise preference rankings—the Bradley-Terry model.