Paper Summary: Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
Authors run several experiments, including ablation tests, to compare the circumstances under which PPO is superior or inferior to DPO in RLHF.
WHY
To check why some papers report that DPO wins over PPO, even though most production models still use PPO.
HOW
Creating synthetical models with known, synthetic, ground truth data, and compare how each strategy (DPO vs PPO) optimize the solution and what mistakes each makes.
Then they use real-world open-source free-text and SFT datasets such as those from Alpaca. Preference datasets such as HH-RLHF and SafeRLHF are also used. Other Coding-specific SFT datasets are also used.
CLAIMS
PPO is consistently better than DPO on code-generation tasks.
Differences in the text distribution between Pretraining, SFT and Preference datasets hurts the performance of DPO.
DPO is also prone to the same objective function "hacking" as seen in PPO (when it's not normalized by the KL-divergence term)
- "...any solution found by PPO also minimizes the DPO objective and thus, any solution found by PPO that exploits the reward model can also be found by DPO".
These factors help PPO Performance:
- Using advantage normalization;
- Using large batch-size during training: Consistently improve optimization;
- Using exponential moving-averages when updating the base SFT model during RL optimization.
QUOTES
- Reward-based vs Reward-free RLHF "Existing RLHF methods can be roughly categorized as either reward-based or reward-free"
- InstructGPT and others are Reward-based (Reward Model + PPO) while DPO is reward-free
EXTENDS/USES
- Llama2 (Touvron et al., 2023) base model
- Alpaca (Taori et al., 2023) preference dataset
- SafeRLHF (Dai et al., 2023) input-output dataset
- DeepSpeed-Chat (Yao et al., 2023) PPO implementation
NOTES
Authors mention other reward-free methods: RRHF and PRO.
The preference datasets have labels both for Helpfulness and Harmlessness (Safety).
- Interestingly, the last H (honest) is not part of the optimization, but it's verified through unit-tests (!)