arXiv 2502.03095

Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms

By Xuerui Su, Yue Wang, et al.

Published 2025-02-05

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignmen…

View the original paper on arXiv