arXiv 2512.01374

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

By Chujie Zheng, Kai Dang, et al.

Published 2025-12-01

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference di…

View the original paper on arXiv