arXiv 2502.03095

Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms

By Xuerui Su, Yue Wang, et al.

Published 2025-02-05

Citation lineage

Review the prior work and downstream research connected to this paper.

With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignmen…

View the original paper on arXiv