arXiv 2209.07858

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

By Deep Ganguli, Liane Lovitt, et al.

Published 2022-08-23

Citation lineage

Review the prior work and downstream research connected to this paper.

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with reje…

View the original paper on arXiv