arXiv 2307.15217
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
By Stephen Casper, Xander Davies, et al.
Published 2023-07-27
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) o…