arXiv 2212.09251

Discovering Language Model Behaviors with Model-Written Evaluations

By Ethan Perez, Sam Ringer, et al.

Published 2022-12-19

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to writ…

View the original paper on arXiv