arXiv 2510.05024
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
By Nevan Wichers, Aram Ebtekar, et al.
Published 2025-10-06
Citation lineage
Review the prior work and downstream research connected to this paper.
Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired beha…