arXiv 2502.05209
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
By Zora Che, Stephen Casper, et al.
Published 2025-02-03
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot fully evaluate realistic risks from open-weight models. Secon…