Paper Summary: A General Theoretical Paradigm to Understand Learning from Human Preferences

Last updated: 22 Jul 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

A General Theoretical Paradigm to Understand Learning from Human Preferences Source

WHAT

Authors propose a generalized framework called Ψ-PO to represent generic differentiable functions that can directly optimize preference data (similarly to DPO). DPO and RLHF are special cases of this new generic formulation.

They also introduce one specific instantiation of the framework, called IPO or Identity Preference Optimization, which has some benefits over DPO.

WHY

Although DPO worked in practice, authors claim it lacked a more thorough theoretical grounding.

HOW

Authors find a generalization of both RLHF and DPO in the Ψ-PO framework
They then experiment with using an identidy function as the Ψ and arrive at IPO (Identity Preference Optimization)

CLAIMS/QUOTES

RLHF and DPO must overfit because they assume BT equivalence: "This is due to the fact that [RLHF and DPO] rely on the strong assumption that pairwise preferences can be substituted with ELo-score (pointwise rewards) via a Bradley-Terry (BT) modelisation (Bradley and Terry,1952). "
Why DPO Overfits: "DPO is prone to overfitting, and this stems from a combination of the unboundedness of Ψ, together with not training an explicit reward function"
KL-divergence regularization grows weak the stronger the preference: "... the strength of the KL-regularisation becomes weaker and weaker the more deterministic the preferences"
IPO regularizes where DPO doesn't: "In other words IPO, unlike DPO, always regularizes its solution towards \(\pi{ref}"\)_

EXTENDS/USES

DPO
RLHF

NOTES

Pairwise preference dataset still needed: IPO still needs a prefernce dataset with pairwise or ranked preferences, just like RLHF and DPO.

MY 2¢

Very nice theoretical finding, but the article could've included an empirical test between DPO, RLHF and IPO on some controlled environment. (See references)

REFERENCES

HuggingFace: Comparing DPO, IPO and KTO on 7B models
- Concludes that DPO is overall better, with IPO close behind and KTO further back
- A lot of manual hyperparam tuning though — hard to say if it's a fair comparison.

Felipe 21 Jul 2025 22 Jul 2025 paper-summary instruction-tuning language-modeling