arXiv 2502.03095
Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms
By Xuerui Su, Yue Wang, et al.
Published 2025-02-05
Citation lineage
Review the prior work and downstream research connected to this paper.
With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignmen…