arXiv 2501.18837

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

By Mrinank Sharma, Meg Tong, et al.

Published 2025-01-31

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural…

View the original paper on arXiv