arXiv 2508.14460
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
By Shuaijie She, Yu Bao, et al.
Published 2025-08-20
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation…