arXiv 2501.18837
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
By Mrinank Sharma, Meg Tong, et al.
Published 2025-01-31
Citation lineage
Review the prior work and downstream research connected to this paper.
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural…