arXiv 2309.00267

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

By Harrison Lee, Samrat Phatale, et al.

Published 2023-09-01

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue ge…

View the original paper on arXiv