arXiv 2501.13011
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
By Sebastian Farquhar, Vikrant Varma, et al.
Published 2025-01-22
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Ap…