arXiv 2510.05024

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

By Nevan Wichers, Aram Ebtekar, et al.

Published 2025-10-06

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired beha…

View the original paper on arXiv