Paper Summary: Training language models to follow instructions with human feedback

Last updated: 06 Apr 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

Training language models to follow
instructions with human feedback Source

WHAT

Introduce a strategy —InstructGPT— to fine-tune pre-trained LLMs to follow human instructions using Reinforcement Learning.²

WHY

Pretraining LLMs on unlabelled data does not make them good at following instructions or providing output that's aligned with the user's intent: We need something else.

RLHF

It's a 3-stage strategy (assumes you already have a pre-trained, so-called vannila LM)
- 1) Supervised Fine-tuning (SFT): Sample the vanilla LM and give out some of those prompts to human annotators and have them write a proper response to that prompt. Then fine-tune the pre-trained LM in a supervised manner on those prompt/answer pairs.
- 2) Reward Model (RM) With the fine-tuned LM, we again sample some prompts and feed them to the model¹ and get some outputs. We then ask human annotators to rank the outputs on a Likert scale to define how aligned the outputs are to the original prompt.
  - The outcome is a model (RM) that takes a prompt/output pair and says how aligned it is to what humans usually want.
  - Also an LLM; can be Transformer-based
- 3) RL Fine-tuning Intiate a Reinforcement Learning (RL) feedback loop whereby:
  - Sample the LM for a prompt/output pair
  - Score the prompt/output pair with the Reward model a Preference Reward)
  - Score the output with the original LM itself (before fine-tuning) to see how close to "normal language" the output is.
  - PPO-ptx: Calculate a Final Reward that takes into account both the preference Reward and the original LM perplexity to make sure the output is both good in terms of alignment but also that it should be natural (as defined by the original, untuned LM)
  - Feed the Final Reward back to the LM and repeat the loop

HOW

The how is basically applying RLHF to a GPT-3 LM, with some technical optimizations.

PPO (Proximal Policy Optimization) is used to update the LM in the RL Fine-tuning loop, with a modification that lends some weight to the original, untuned LM (PPO-ptx, see above RLHF)

CLAIMS

InstructGPT (1.3B params) provides better outputs than GPT-3 (175B params). (According to labelers)
The cost of increasing model alignment is **modest* relative to pretraining"*
Learned alignment generalizes to hold-out annotators
PPO-ptx can be used to avoid regressions (i.e. text that is statistically very close to preferences but unnatural and/or bad in other ways)

QUOTES

Misalignment: "... the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective "follow the user’s instructions helpfully and safely""
Alignment Tax: "... our alignment procedure comes at the cost of lower performance on certain tasks that we may care about."
- This is reduced with PPO-ptx

NOTES

The 3 H's (helpful, honest, and harmless) of implicit alignment were defined in Askell et al., 2021. (see refs)
Types of alignment
- Explicit alignment: Following express orders such as "write a list such that..."
- Implicit alignment: Not producing outright misleading text, not hallucinating.

USES

PPO (Schulman et al., 2017)
RLHF (Christiano et al., 2017)

MY 2¢

In addition to every technological breakthrough in the paper, it's a masterpiece of experiment design as well. Everything is done toavoid bias, inaccuracies and make efficient use of the resources (humans, computing, etc)

Footnotes

1: With an appropriate temperature setting, to generate diverse samples.

2: It is widely believed that ChatGPT was trained using RLHF as described in this article.

References

Arxiv: Ouyang et al 2022: Training language models to follow instructions with human feedback
Open AI Blog: Aligning Language Models to Follow Instructions
Youtube: Reinforcement Learning from Human Feedback: From Zero to chatGPT
- Amazing Video Lecture on RLHF by Nathan Lambert @HuggingFace
Arxiv: Askell et al 2021: A General Language Assistant as a Laboratory for Alignment

Felipe 05 Feb 2023 06 Apr 2025 paper-summary language-models alignment