arXiv 2308.10248

Steering Language Models With Activation Engineering

By Alexander Matt Turner, Lisa Thiergart, et al.

Published 2023-08-20

Citation lineage

Review the prior work and downstream research connected to this paper.

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which con…

View the original paper on arXiv