arXiv 2307.15043
Universal and Transferable Adversarial Attacks on Aligned Language Models
By Andy Zou, Zifan Wang, et al.
Published 2023-07-27
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, w…