arXiv 2409.12917
Training Language Models to Self-Correct via Reinforcement Learning
By Aviral Kumar, Vincent Zhuang, et al.
Published 2024-09-19
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that…