arXiv 2501.13011
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
By Sebastian Farquhar, Vikrant Varma, et al.
Published 2025-01-22
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Ap…