arXiv 2406.14144

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

By Jianhui Chen, Xiaozhi Wang, et al.

Published 2024-06-20

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time act…

View the original paper on arXiv