Paper Summary: A General Theoretical Paradigm to Understand Learning from Human Preferences

Paper Summary: A General Theoretical Paradigm to Understand Learning from Human Preferences

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

article-cover-for-ipo-paper A General Theoretical Paradigm to Understand Learning from Human Preferences Source

WHAT

Authors propose a generalized framework called Ψ-PO to represent generic differentiable functions that can directly optimize preference data (similarly to DPO). DPO and RLHF are special cases of this new generic formulation.

They also introduce one specific instantiation of the framework, called IPO or Identity Preference Optimization, which has some benefits over DPO.

WHY

Although DPO worked in practice, authors claim it lacked a more thorough theoretical grounding.

HOW

  • Authors find a generalization of both RLHF and DPO in the Ψ-PO framework

  • They then experiment with using an identidy function as the Ψ and arrive at IPO (Identity Preference Optimization)

CLAIMS/QUOTES

  • RLHF and DPO must overfit because they assume BT equivalence: "This is due to the fact that [RLHF and DPO] rely on the strong assumption that pairwise preferences can be substituted with ELo-score (pointwise rewards) via a Bradley-Terry (BT) modelisation (Bradley and Terry,1952). "

  • Why DPO Overfits: "DPO is prone to overfitting, and this stems from a combination of the unboundedness of Ψ, together with not training an explicit reward function"

  • KL-divergence regularization grows weak the stronger the preference: "... the strength of the KL-regularisation becomes weaker and weaker the more deterministic the preferences"

  • IPO regularizes where DPO doesn't: "In other words IPO, unlike DPO, always regularizes its solution towards \(\pi{ref}"\)_

EXTENDS/USES

  • DPO
  • RLHF

NOTES

  • Pairwise preference dataset still needed: IPO still needs a prefernce dataset with pairwise or ranked preferences, just like RLHF and DPO.

MY 2¢

  • Very nice theoretical finding, but the article could've included an empirical test between DPO, RLHF and IPO on some controlled environment. (See references)

REFERENCES

Dialogue & Discussion