arXiv 2502.05209
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
By Zora Che, Stephen Casper, et al.
Published 2025-02-03
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot fully evaluate realistic risks from open-weight models. Secon…