Paper Summary: Llama 2: Open Foundation and Fine-Tuned Chat Models
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
Updated version of LLaMA 1 (summary) with more data (still fully open), double the context size, and enhanced attention.
Two model variations are published: a vanilla LLM and an instruction-tuned version.
HOW
LLaMA-2: Similar to LLaMA-1, with 40% more data (only public data), better data cleaning and larger context. One epoch over the training data. Also, enhanced attention.
LLaMA-2-chat: SFT and RLHF instruction-tuning on top of LLaMA-2.
CLAIMS
Using a smaller but higher-quality preference dataset yields better results.
RLHF is responsible for most of the increase in instruction-following performance.
QUOTES
Small but high-quality instruction-following data for SFT: "We found that SFT annotations in the order of tens of thousands was (sic) enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of 27,540 annotations"
Reward model initialization: "We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model knows."
EXTENDS/USES
Main architectural decisions from LLaMA-1 (Touvron et al., 2023).
Grouped-query Attention (GQA), from Ainslie et al., 2023.
RLHF loop from Instruct-GPT (Ouyang et al., 2022).
- But they experiment with Rejection Sampling Fine-tuning instead of PPO.
- But they experiment with Rejection Sampling Fine-tuning instead of PPO.
NOTES
Just like the DPO paper (summary), the authors used GPT-4 to evaluate the models subjectively.
Authors tried to decrease hallucination by oversampling known trusted sources.
Two reward models were trained, one optimized only helpfulness and the other only optimized safety.
The reward model is also a transformer-based LM (but trained for regression instead of predicting the next token).
Authors introduce a variant of Attention during fine-tuning, called Ghost Attention. The objective is to help the optimizer learn from multi-turn messaging like a chat conversation.
Authors used red-team adversarial attacks on the model, to test its safety.
MY 2¢
- PPL shows no sign of saturation as more tokens are used (Figure 5)