arXiv 2502.05209

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

By Zora Che, Stephen Casper, et al.

Published 2025-02-03

Discussion

Read the public discussion and references gathered around this paper.

Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot fully evaluate realistic risks from open-weight models. Secon…

View the original paper on arXiv