arXiv 2508.14460

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

By Shuaijie She, Yu Bao, et al.

Published 2025-08-20

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation…

View the original paper on arXiv