arXiv 2404.12038
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
By Zhihao Xu, Ruixuan Huang, et al.
Published 2024-04-18
Citation lineage
Review the prior work and downstream research connected to this paper.
Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks…