arXiv 2309.00267

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

By Harrison Lee, Samrat Phatale, et al.

Published 2023-09-01

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue ge…

View the original paper on arXiv