arXiv 2411.11114
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
By Zeqing He, Zhibo Wang, et al.
Published 2024-11-17
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the deg…