arXiv 2411.11114

JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

By Zeqing He, Zhibo Wang, et al.

Published 2024-11-17

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the deg…

View the original paper on arXiv