arXiv 2502.03095
Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms
By Xuerui Su, Yue Wang, et al.
Published 2025-02-05
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignmen…