arXiv 2309.00267
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
By Harrison Lee, Samrat Phatale, et al.
Published 2023-09-01
Citation lineage
Review the prior work and downstream research connected to this paper.
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue ge…