arXiv 2411.11114
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
By Zeqing He, Zhibo Wang, et al.
Published 2024-11-17
Citation lineage
Review the prior work and downstream research connected to this paper.
Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the deg…