arXiv 2404.12038

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

By Zhihao Xu, Ruixuan Huang, et al.

Published 2024-04-18

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks…

View the original paper on arXiv