arXiv 2504.05259
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
By Tomek Korbak, Mikita Balesni, et al.
Published 2025-04-07
Citation lineage
Review the prior work and downstream research connected to this paper.
As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations…