arXiv 2510.17431

Agentic Reinforcement Learning for Search is Unsafe

By Yushi Yang, Shreyansh Padarha, et al.

Published 2025-10-20

Discussion

Read the public discussion and references gathered around this paper.

Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries.…

View the original paper on arXiv