arXiv 2209.07858
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
By Deep Ganguli, Liane Lovitt, et al.
Published 2022-08-23
Citation lineage
Review the prior work and downstream research connected to this paper.
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with reje…