Paper Summary: Training language models to follow instructions with human feedback

Paper Summary: Training language models to follow instructions with human feedback

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

Focuses on aligning the output of LLMs to perceived human preferences.

WHY

Because even with prompt hacking (e.g. adding TLDR; to force summarization), LMs optimized to language understanding don't always generate the outputs needed in a generic AI-assistant setting.

RLHF

  • 3-stage strategy, the last 2 steps of which refer to a RLHF Strategy (Reinforcement Learning with Human Feedback):

    • 0) Pretrained LM: Either train a large LM yourself or take a pretrained one such as GPT-3.
    • 1) Supervised Fine-tuning (SFT): Sample the LM from Step 0 and give out some of those prompts to human annotators and have them write a proper response to that prompt. Then fine-tune the pretrained LM in a supervised manner on those prompt/answer pairs.
    • 2) Reward Model (RM) With the fine-tuned LM, we again sample some prompts and feed them to the model1 and get some outputs. We then ask human annotators to rank the outputs on a likert scale to define how aligned the outputs are to the original prompt.
      • The outcome is a model (RM) that takes a prompt/output pair and says how aligned it is to what humans usually want.
      • Also a LLM, can be Transformer-based
    • 3) RL Fine-tuning Intiate a Reinforcement Learning (RL) feedback loop whereby:
      • Sample the LM for a prompt/output pair
      • Score the prompt/output pair with the Reward model a Preference Reward)
      • Score the output with the original LM itself (before fine-tuning) to see how close to "normal language" the output is.
      • PPO-ptx: Calculate a Final Reward that takes into account both the preference Reward and the original LM perplexity to make sure the output is both good in terms of alignment but also that it should be natural (as defined by the original, untuned LM)
      • Feed the Final Reward back to the LM and repeat the loop

HOW

The how is basically applying RLHF to a GPT-3 LM, with some technical optimizations.

PPO (Proximal Policy Optimization) is used to update the LM in the RL Fine-tuning loop, with a modification that lends some weight to the original, untuned LM (PPO-ptx, see above RLHF)

CLAIMS

  • InstructGPT (1.3B params) provides better outputs than GPT-3 (175B params). (According to labelers)

  • The cost of increasing model alignment is **modest* relative to pretraining"*

  • Learned alignment generalizes to hold-out annotators

  • PPO-ptx can be used to avoid regressions (i.e. text that is statistically very close to preferences but unnatural and/or bad in other ways)

QUOTES

  • Misalignment: "... the language modeling objective used for many recent large LMs — predicting the next token on a webpage from the internet — is different from the objective “follow the user’s instructions helpfully and safely""

  • Alignment Tax: "... our alignment procedure comes at the cost of lower performance on certain tasks that we may care about."

    • This is reduced with PPO-ptx

NOTES

  • The 3 H's (helpful, honest and harmless) of implicit alignment were defined in Askell et al 2021 (see refs)

  • Types of alignment

    • Explicit alignment: Following express orders such as "write a list such that..."
    • Implicit alignment: Not producing outright misleading text, not hallucinating.

MY 2¢

  • In addition to every technical breakthrough in the paper, it's a masterpiece of experiment design as well. Everything is done so as to avoid bias, inaccuracies and make efficient use of the resources (humans, computing, etc)

Footnotes

1: With an appropriate temperature setting, to generate diverse samples.


References

Dialogue & Discussion