arXiv 2309.00267

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

By Harrison Lee, Samrat Phatale, et al.

Published 2023-09-01

Citation lineage

Review the prior work and downstream research connected to this paper.

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue ge…

View the original paper on arXiv