arXiv 2505.08054

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

By Zhehao Zhang, Weijie Xu, et al.

Published 2025-05-12

Citation lineage

Review the prior work and downstream research connected to this paper.

Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent int…

View the original paper on arXiv