Paper Summary: A General Theoretical Paradigm to Understand Learning from Human Preferences
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
Authors propose a generalized framework called Ψ-PO to represent generic differentiable functions that can directly optimize preference data (similarly to DPO). DPO and RLHF are special cases of this new generic formulation.
They also introduce one specific instantiation of the framework, called IPO or Identity Preference Optimization, which has some benefits over DPO.
WHY
Although DPO worked in practice, authors claim it lacked a more thorough theoretical grounding.
HOW
Authors find a generalization of both RLHF and DPO in the Ψ-PO framework
They then experiment with using an identidy function as the Ψ and arrive at IPO (Identity Preference Optimization)
CLAIMS/QUOTES
RLHF and DPO must overfit because they assume BT equivalence: "This is due to the fact that [RLHF and DPO] rely on the strong assumption that pairwise preferences can be substituted with ELo-score (pointwise rewards) via a Bradley-Terry (BT) modelisation (Bradley and Terry,1952). "
Why DPO Overfits: "DPO is prone to overfitting, and this stems from a combination of the unboundedness of Ψ, together with not training an explicit reward function"
KL-divergence regularization grows weak the stronger the preference: "... the strength of the KL-regularisation becomes weaker and weaker the more deterministic the preferences"
IPO regularizes where DPO doesn't: "In other words IPO, unlike DPO, always regularizes its solution towards \(\pi{ref}"\)_
EXTENDS/USES
- DPO
- RLHF
NOTES
- Pairwise preference dataset still needed: IPO still needs a prefernce dataset with pairwise or ranked preferences, just like RLHF and DPO.
MY 2¢
- Very nice theoretical finding, but the article could've included an empirical test between DPO, RLHF and IPO on some controlled environment. (See references)
REFERENCES
- HuggingFace: Comparing DPO, IPO and KTO on 7B models
- Concludes that DPO is overall better, with IPO close behind and KTO further back
- A lot of manual hyperparam tuning though — hard to say if it's a fair comparison.