arXiv 2510.05024

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

By Nevan Wichers, Aram Ebtekar, et al.

Published 2025-10-06

Citation lineage

Review the prior work and downstream research connected to this paper.

Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired beha…

View the original paper on arXiv