arXiv 2401.10020

Self-Rewarding Language Models

By Weizhe Yuan, Richard Yuanzhe Pang, et al.

Published 2024-01-18

Citation lineage

Review the prior work and downstream research connected to this paper.

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models,…

View the original paper on arXiv