arXiv 2406.14144

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

By Jianhui Chen, Xiaozhi Wang, et al.

Published 2024-06-20

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time act…

View the original paper on arXiv