arXiv 2501.18837

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

By Mrinank Sharma, Meg Tong, et al.

Published 2025-01-31

Citation lineage

Review the prior work and downstream research connected to this paper.

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural…

View the original paper on arXiv