arXiv 2404.12038

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

By Zhihao Xu, Ruixuan Huang, et al.

Published 2024-04-18

Citation lineage

Review the prior work and downstream research connected to this paper.

Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks…

View the original paper on arXiv