arXiv 2411.11114
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
By Zeqing He, Zhibo Wang, et al.
Published 2024-11-17
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the deg…