arXiv 2308.10248
Steering Language Models With Activation Engineering
By Alexander Matt Turner, Lisa Thiergart, et al.
Published 2023-08-20
Citation lineage
Review the prior work and downstream research connected to this paper.
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which con…