arXiv 2404.12038

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

By Zhihao Xu, Ruixuan Huang, et al.

Published 2024-04-18

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks…

View the original paper on arXiv