arXiv 2404.12038
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
By Zhihao Xu, Ruixuan Huang, et al.
Published 2024-04-18
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks…