arXiv 2512.01374
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
By Chujie Zheng, Kai Dang, et al.
Published 2025-12-01
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference di…