arXiv 2510.01088

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

By Guobin Shen, Dongcheng Zhao, et al.

Published 2025-10-01

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous cont…

View the original paper on arXiv