Paper Summary: Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Last updated: 21 Nov 2024

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study Source

WHAT

Authors run several experiments, including ablation tests, to compare the circumstances under which PPO is superior or inferior to DPO in RLHF.

WHY

To check why some papers report that DPO wins over PPO, even though most production models still use PPO.

HOW

Creating synthetical models with known, synthetic, ground truth data, and compare how each strategy (DPO vs PPO) optimize the solution and what mistakes each makes.

Then they use real-world open-source free-text and SFT datasets such as those from Alpaca. Preference datasets such as HH-RLHF and SafeRLHF are also used. Other Coding-specific SFT datasets are also used.

CLAIMS

PPO is consistently better than DPO on code-generation tasks.
Differences in the text distribution between Pretraining, SFT and Preference datasets hurts the performance of DPO.
DPO is also prone to the same objective function "hacking" as seen in PPO (when it's not normalized by the KL-divergence term)
- "...any solution found by PPO also minimizes the DPO objective and thus, any solution found by PPO that exploits the reward model can also be found by DPO".
These factors help PPO Performance:
- Using advantage normalization;
- Using large batch-size during training: Consistently improve optimization;
- Using exponential moving-averages when updating the base SFT model during RL optimization.

QUOTES

Reward-based vs Reward-free RLHF "Existing RLHF methods can be roughly categorized as either reward-based or reward-free"
- InstructGPT and others are Reward-based (Reward Model + PPO) while DPO is reward-free

EXTENDS/USES

Llama2 (Touvron et al., 2023) base model
Alpaca (Taori et al., 2023) preference dataset
SafeRLHF (Dai et al., 2023) input-output dataset
DeepSpeed-Chat (Yao et al., 2023) PPO implementation

NOTES

Authors mention other reward-free methods: RRHF and PRO.
The preference datasets have labels both for Helpfulness and Harmlessness (Safety).
- Interestingly, the last H (honest) is not part of the optimization, but it's verified through unit-tests (!)

References

Arxiv: Xu et al., 2024: Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Felipe 06 Oct 2024 21 Nov 2024 paper-summary alignment instruction-tuning