Paper Summary: Training language models to follow instructions with human feedback

Paper Summary: Training language models to follow instructions with human feedback

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

instruct-gpt Training language models to follow
instructions with human feedback Source


Introduce a strategy —InstructGPT— to fine-tune pre-trained LLMs to follow human instructions using Reinforcement Learning.2


Pretraining LLMs on unlabelled data does not make them good at following instructions or providing output that's aligned with the user's intent: We need something else.


  • It's a 3-stage strategy (assumes you already have a pre-trained, so-called vannila LM)

    • 1) Supervised Fine-tuning (SFT): Sample the vanilla LM and give out some of those prompts to human annotators and have them write a proper response to that prompt. Then fine-tune the pre-trained LM in a supervised manner on those prompt/answer pairs.
    • 2) Reward Model (RM) With the fine-tuned LM, we again sample some prompts and feed them to the model1 and get some outputs. We then ask human annotators to rank the outputs on a Likert scale to define how aligned the outputs are to the original prompt.
      • The outcome is a model (RM) that takes a prompt/output pair and says how aligned it is to what humans usually want.
      • Also an LLM; can be Transformer-based
    • 3) RL Fine-tuning Intiate a Reinforcement Learning (RL) feedback loop whereby:
      • Sample the LM for a prompt/output pair
      • Score the prompt/output pair with the Reward model a Preference Reward)
      • Score the output with the original LM itself (before fine-tuning) to see how close to "normal language" the output is.
      • PPO-ptx: Calculate a Final Reward that takes into account both the preference Reward and the original LM perplexity to make sure the output is both good in terms of alignment but also that it should be natural (as defined by the original, untuned LM)
      • Feed the Final Reward back to the LM and repeat the loop


The how is basically applying RLHF to a GPT-3 LM, with some technical optimizations.

PPO (Proximal Policy Optimization) is used to update the LM in the RL Fine-tuning loop, with a modification that lends some weight to the original, untuned LM (PPO-ptx, see above RLHF)


  • InstructGPT (1.3B params) provides better outputs than GPT-3 (175B params). (According to labelers)

  • The cost of increasing model alignment is **modest* relative to pretraining"*

  • Learned alignment generalizes to hold-out annotators

  • PPO-ptx can be used to avoid regressions (i.e. text that is statistically very close to preferences but unnatural and/or bad in other ways)


  • Misalignment: "... the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective "follow the user’s instructions helpfully and safely""

  • Alignment Tax: "... our alignment procedure comes at the cost of lower performance on certain tasks that we may care about."

    • This is reduced with PPO-ptx


  • The 3 H's (helpful, honest, and harmless) of implicit alignment were defined in Askell et al., 2021. (see refs)

  • Types of alignment

    • Explicit alignment: Following express orders such as "write a list such that..."
    • Implicit alignment: Not producing outright misleading text, not hallucinating.

MY 2¢

  • In addition to every technological breakthrough in the paper, it's a masterpiece of experiment design as well. Everything is done toavoid bias, inaccuracies and make efficient use of the resources (humans, computing, etc)


1: With an appropriate temperature setting, to generate diverse samples.

2: It is widely believed that ChatGPT was trained using RLHF as described in this article.


Dialogue & Discussion