Paper Summary: Fine-Tuning Language Models from Human Preferences
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
RLHF is used to fine-tuned pretrained LLMs for specific tasks (textual style transfer and text summarization).
WHY
Because RL/RLHF hadn't hitherto been used to fine-tune LLMs and authors thought it could lead to good results.
HOW
Traditional RLHF flow using 4-wise preference data
- 1) Collect preference data
- 2) Train reward model with the data from Step 1
- 3) RL loop to optimize the base LLM (using PPO) to maximize the reward from the reward model from Step 2
CLAIMS/QUOTES
SFT+RL is better: "On both datasets, supervised + RL fine-tuning is best, and indeed pure RL finetuning is worse than the supervised baseline according to ROUGE in all cases ..."
On miscommunication and incentives for human labelers: "This reveals a mismatch between the notion of quality we wanted our model to learn, and what the humans labelers actually evaluated"
Ambiguous tasks make labeling hard: Subjective evaluation on what it meant for a summary to be good caused a lot of variance in the labeling.: "When possible, it seems better to design less ambiguous labeling tasks that get at the same information."
Mixed results depending on the task: On the style transfer task the RLHF'd models performs very well but on the text-summarization task the RLHF'd model learned to just select (copy) some sentences from the original text.
EXTENDS/USES
- GPT-2 by Radford et al. 2018 summary
- PPO (Proximal Policy Optimization) by Schulman et al. 2017 summary
NOTES
SFT beforehand: Authors mention that SFT was applied before the RL loop.
KL-divergence penalty: The KL-divergence penalty to prevent reward hacking was apparently introduced here. But authors say that it was used to "encourage coherence and topicality".
KL-divergence hyperparameters: The authors weigh the KL penalty with a \(\beta\) term and they vary it empirically for each problem to find the best values.
RL learned to copy: In the text summarization task, the RL-only models learned to just copy some sentences as a way of summarizing the texts (a bit of a reward hacking for text summarization)