arXiv 2504.05259

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

By Tomek Korbak, Mikita Balesni, et al.

Published 2025-04-07

Citation lineage

Review the prior work and downstream research connected to this paper.

As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations…

View the original paper on arXiv