arXiv 2404.14313
Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
By Jan-Philipp Fränken, Eric Zelikman, et al.
Published 2024-04-22
Discussion
Read the public discussion and references gathered around this paper.
When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative alg…