arXiv 2404.14313

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

By Jan-Philipp Fränken, Eric Zelikman, et al.

Published 2024-04-22

Discussion

Read the public discussion and references gathered around this paper.

When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative alg…

View the original paper on arXiv