Paper Summary: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Last updated: 02 Aug 2023

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model Source

WHAT

An approach to align pre-trained LMs to human preferences without using Reinforcement Learning (RL).

WHY

Because RL-based instruction-tuning methods (such as RLHF) are costly and difficult to implement.

HOW

The authors figured out a way to represent the objective function from RLHF as a loss function that can be directly optimized using algorithms such as SGD.

A dataset containing good (so-called preferred) as well as bad (so-called dispreferred) prompt/output pairs is needed to fine-tune the model. The loss function includes both types of pairs to calculate the loss.

CLAIMS

Objective evaluation: better results than PPO (the RL algorithm used by RHLF) as measured by reward and KL-divergence from the original text distribution.
Subjective evaluation: Also better results than RLHF-PPO but the comparison setup is very nontraditional and based upon proxies. Authors use GPT-4 to provide ground truth for experiments, sentiment classifiers to filter generated text wrt sentiment, etc.
Learning with DPO is more stable (smaller variance) than RLHF-PPO.
DPO converges quickly.

NOTES

GPT-4 (zero-shot) was used to evaluate DPO against other types of fine-tuning. Crazy.
DPO was applied on an LM that had been previously fine-tuned with regular SFT.

References

Arxiv: Rafailov et al., 2023: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Felipe 23 Jun 2023 02 Aug 2023 paper-summary instruction-following

WHAT

WHY

HOW

CLAIMS

NOTES

References

Dialogue & Discussion