arXiv 2307.15217

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

By Stephen Casper, Xander Davies, et al.

Published 2023-07-27

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) o…

View the original paper on arXiv