arXiv 2307.15043

Universal and Transferable Adversarial Attacks on Aligned Language Models

By Andy Zou, Zifan Wang, et al.

Published 2023-07-27

Citation lineage

Review the prior work and downstream research connected to this paper.

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, w…

View the original paper on arXiv