arXiv 2409.12917

Training Language Models to Self-Correct via Reinforcement Learning

By Aviral Kumar, Vincent Zhuang, et al.

Published 2024-09-19

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that…

View the original paper on arXiv