Paper Summary: Fine-Tuning Language Models from Human Preferences

Paper Summary: Fine-Tuning Language Models from Human Preferences

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

fine-tuning-language-models-from-human-preferences-article-cover Fine-Tuning Language Models from Human Preferences Source

WHAT

RLHF is used to fine-tuned pretrained LLMs for specific tasks (textual style transfer and text summarization).

WHY

Because RL/RLHF hadn't hitherto been used to fine-tune LLMs and authors thought it could lead to good results.

HOW

Traditional RLHF flow using 4-wise preference data

  • 1) Collect preference data
  • 2) Train reward model with the data from Step 1
  • 3) RL loop to optimize the base LLM (using PPO) to maximize the reward from the reward model from Step 2

CLAIMS/QUOTES

  • SFT+RL is better: "On both datasets, supervised + RL fine-tuning is best, and indeed pure RL finetuning is worse than the supervised baseline according to ROUGE in all cases ..."

  • On miscommunication and incentives for human labelers: "This reveals a mismatch between the notion of quality we wanted our model to learn, and what the humans labelers actually evaluated"

  • Ambiguous tasks make labeling hard: Subjective evaluation on what it meant for a summary to be good caused a lot of variance in the labeling.: "When possible, it seems better to design less ambiguous labeling tasks that get at the same information."

  • Mixed results depending on the task: On the style transfer task the RLHF'd models performs very well but on the text-summarization task the RLHF'd model learned to just select (copy) some sentences from the original text.

EXTENDS/USES

  • GPT-2 by Radford et al. 2018 summary
  • PPO (Proximal Policy Optimization) by Schulman et al. 2017 summary

NOTES

  • SFT beforehand: Authors mention that SFT was applied before the RL loop.

  • KL-divergence penalty: The KL-divergence penalty to prevent reward hacking was apparently introduced here. But authors say that it was used to "encourage coherence and topicality".

  • KL-divergence hyperparameters: The authors weigh the KL penalty with a \(\beta\) term and they vary it empirically for each problem to find the best values.

  • RL learned to copy: In the text summarization task, the RL-only models learned to just copy some sentences as a way of summarizing the texts (a bit of a reward hacking for text summarization)


REFERENCES

Dialogue & Discussion